 Command

Sam Foreman's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman docs/test: Docs Test ideas: 💡 Ideas about/more: 🪪 More now: Now more: ➕ More posts: 📬 Posts projects: 📚 Projects talks: 🎙️ Talks webtui: Style posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/2026/01/07: 🎉 Happy New Year! posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2026/06/03: Production Pre-Training at Scale: The Good, the Bad, and the Restarts talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

🎉 Happy New Year!

A New Year update summarizing ongoing projects including AuroraGPT, AERIS, and other involvements at Argonne.

Sam Foreman 2026-01-07

I’d like to try and post more this year.

Ideally these would be less-polished, more-frequent updates on what I’m thinking about / working on.

Ongoing Projects

  • AuroraGPT: Large Language Models for Scientific Applications on leadership-class supercomputers1.

    Additional details can be found in some of my recent talks:

  • AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions (Hatanpää et al. (2025)).

    Foundation models for Earth system science, pushing on coupled modeling, uncertainty, and long-horizon prediction. Additional details can be found in some of my recent talks:

  • 🍋 ezpz: A growing collection of utilities for launching, instrumenting, and debugging distributed jobs on real HPC systems.

    This started as glue code and turned into infrastructure. A dedicated post is coming.

  • Genesis Project

    • The American Science Cloud (AmSC): A Platform for Transformative Science

      More info Cornerstone of the Genesis Mission’s Platform infrastructure, hosting and distributing AI models and scientific data to the broader research community. AmSC will enable the National Labs, industry, and research partners to curate and apply DOE’s extensive AI-ready scientific data.

    • The Transformational AI Models Consortium (ModCon): Cornerstone of the Genesis Mission’s AI models and data efforts,

      More info Cornerstone of the Genesis Mission’s AI models and data efforts, will build and deploy self-improving AI models that advance science, engineering, and energy missions by harnessing DOE’s unique data, facilities, and expertise. Selected teams will develop foundational capabilities needed across multiple scientific and engineering domains.

Additional Involvements

  • DeepSpeed Technical Steering Committee2

    More info The DeepSpeed Technical Steering Committee (TSC) is responsible for guiding the overall direction and development of the DeepSpeed project.

  • CPSC: Member of the Coordinating Panel for Software and Computing.

    More info

    The Coordinating Panel for Software and Computing (CPSC) serves as a forum for the U.S. high energy physics (HEP) community to address shared challenges in scientific computing.

    Hosted by the Division of Particles and Fields (DPF) of the American Physical Society(APS), the panel brings together researchers, developers, institutions, and industry partners to strengthen the software and computing ecosystem that underpins modern HEP research. Through coordination, advocacy, and community-building, CPSC works to foster innovation, support career development, and ensure the computing infrastructure evolves to meet the demands of current and future experiments.

Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. https://arxiv.org/abs/2509.13523.

Footnotes

  1. More on this soon!

  2. Roadmap discussion

NORMAL  main  sam.onl/ posts/2026/01/07/index.mdx · Top 1:1