 Command

Terminal-style knowledge base. Use the keybinds below to move around, then customize the look with the theme picker.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman about/more: 🪪 More ideas: 💡 Ideas docs/test: Docs Test more: ➕ More now: Now posts: 📬 Posts projects: 📚 Projects webtui: Style talks: 🎙️ Talks posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/hpc-user-forum/auroragpt: AuroraGPT talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/checkbox: Checkbox webtui/components/button: Button webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/textarea: Textarea webtui/components/table: Table webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter
about Computational Scientist · Argonne National Laboratory (ALCF)

I'm a Computational Scientist in the AI / ML Group at the ALCF, where I work on training large models for scientific applications.

Generally, I am interested in developing novel algorithms and techniques to enable the training of large language models on scientific data, and in applying these models to accelerate scientific discovery across a wide range of domains.

Recent projects

Some recent projects include AERIS, a large language model for materials science that was a finalist for the 2025 Gordon Bell Prize; MProt-DPO, a large language model for protein design that was a finalist for the 2024 Gordon Bell Prize; and GenSLMs, a family of large language models for genomics that won the 2023 ACM Gordon Bell Special Prize for High Performance Computing-Based COVID-19 Research.

[m]ore →
 [ l]ast.fm Loading... --:-- [
]
work 2025 AERIS: Argonne Earth Systems Model ✨ Gordon Bell Finalist 2025 Aurora: Argonne’s First Exascale Supercomputer 2025 HiPerRAG: High-Performance RAG for Scientific Insights 2025 Automated Tuning for HMC Mass Ratios 2025 MOFA: Materials for Carbon Capture with GenAI 2024 MProt-DPO: Multimodal Protein Design 🌟 Gordon Bell Finalist 2024 Intro to HPC Bootcamp: Energy Justice Projects 2024 Large Transformer Model Training At-Scale 2023 MLMC: Machine Learning Monte Carlo 2023 Protein Generation via Genome-scale Language Models 2023 DeepSpeed4Science Initiative 2023 LLMs on Novel AI Accelerators 2023 GenSLMs: Genome-scale Language Models 🏆 Gordon Bell Special Prize 2022 Lattice QCD and Particle Physics 2022 Applications of ML to Lattice QFT 2021 LeapFrogLayers: Trainable Framework for Sampling 2021 HMC with Normalizing Flows 2021 Deep Learning Hamiltonian Monte Carlo 2020 ML and Neural Networks for Field Theory posts 26/05/01 Running 50k Python Processes on Aurora with ezpz yeet 26/04/27 Pre-Training AuroraGPT with TorchTitan 26/02/28 Comparing Launchers on Aurora 26/01/10 ezpz: distributed PyTorch across any hardware 26/01/07 Happy New Year! 25/11/12 Cooling Down Checkpoints: Best Practices for Model Evaluation 25/10/06 Mixing Between Distributions While Training 25/09/17 `pbs-tui`: TUI for PBS Job Scheduler Monitoring 25/09/12 BlendCorpus + TorchTitan @ ALCF 25/06/14 Building PyTorch 2.8 from Source on Aurora 25/06/02 Mermaid 25/06/01 Nice Headings 25/05/03 Frameworks Issue with numpy \> 2 25/04/28 Building PyTorch 2.6 from Source on Aurora 25/04/26 Sam Foreman’s Résumé 25/03/23 Torchtune on Aurora 25/03/23 Torchtune Patch on Aurora 24/11/15 svgbob 24/10/17 Converting Checkpoints 24/09/17 Spike Skipper talks 25/12/16 AuroraGPT: Training Foundation Models on Supercomputers 25/12/16 AuroraGPT: Training Foundation Models on Supercomputers 25/10/24 Training Foundation Models on Supercomputers 25/10/15 Training Foundation Models on Supercomputers 25/10/08 AERIS: Argonne's Earth Systems Model 25/09/24 Training Foundation Models on Supercomputers 25/09/02 Open SkAI2025 25/09/02 Scientific AI at Scale: AuroraGPT 25/09/02 Scientific AI at Scale: Distributed Training 25/07/31 AuroraGPT 25/05/21 LLMs on Aurora: Overview 25/05/07 ALCF Incite Hackathon 2025 25/05/07 LLMs on Aurora: Hands-On 25/02/12 AuroraGPT: Foundation Models for Science 24/11/05 Parallel Training Methods 24/10/30 AuroraGPT: ANL's General Purpose Scientific LLM 24/10/29 Deep Learning and Foundation Models at Scale 24/09/04 AuroraGPT 24/08/09 Training LLMs at Scale 24/07/17 Training LLMs on Polaris
 Command

Terminal-style knowledge base. Use the keybinds below to move around, then customize the look with the theme picker.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman about/more: 🪪 More ideas: 💡 Ideas docs/test: Docs Test more: ➕ More now: Now posts: 📬 Posts projects: 📚 Projects webtui: Style talks: 🎙️ Talks posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/hpc-user-forum/auroragpt: AuroraGPT talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/checkbox: Checkbox webtui/components/button: Button webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/textarea: Textarea webtui/components/table: Table webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter
NORMAL  main Sam Foreman sam.onl Top 1:1