 Command

Sam Foreman's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman docs/test: Docs Test ideas: 💡 Ideas about/more: 🪪 More now: Now more: ➕ More posts: 📬 Posts projects: 📚 Projects talks: 🎙️ Talks webtui: Style posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/2026/01/07: 🎉 Happy New Year! posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2026/06/03: Production Pre-Training at Scale: The Good, the Bad, and the Restarts talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

Training Foundation Models on Supercomputers

Sam Foreman 2025-10-15

🌐 Distributed Training

🚀 Scaling: Overview

In this talk, we will explore the intricacies of training foundation models on supercomputers. We will discuss the architecture of these models, the computational requirements, and the strategies employed to optimize training processes. Attendees will gain insights into the latest advancements in hardware and software that facilitate efficient model training at scale.

🐢 Training on a Single Device

flowchart LR subgraph G0["`GPU0`"] subgraph N0["`Network`"] end L0("`Loss`") end subgraph D["`Data`"] x("`x0`") x1("`x1`") x2("`x2`") end x --> N0 N0 --> L0 L0 --> N0 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef eblock fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000 class x,L0 red class x1 green class x2 blue class x3 grey class N0,G0,n0 block class D eblock

Figure 1: SLOW !! model size limited by GPU memory

🕸️ Parallelism Strategies

  • Data Parallelism
    • Split data across workers
    • Easiest to implement
    • No changes to model
  • Model Parallelism
    • Split model across workers
  • Hybrid Parallelism
    • Combine data + model parallelism
    • More complex to implement
    • Requires changes to model

👬 Training on Multiple GPUs: Data Parallelism

flowchart LR subgraph D["`Data`"] direction TB x2("`x2`") x1("`x1`") x("`x0`") end direction LR subgraph G0["`GPU0`"] direction LR subgraph N0["`NN`"] end %%y0("`y₀`") L0["`Loss`"] end subgraph G1["`GPU1`"] direction LR subgraph N1["`NN`"] end L1["`Loss`"] end subgraph G2["`GPU2`"] direction LR subgraph N2["`NN`"] end L2["`Loss`"] end x --> N0 x1 --> N1 x2 --> N2 N0 --> L0 N1 --> L1 N2 --> L2 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef eblock fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000 classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000 classDef eblock fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 class x,y0,L0 red class x1,L1 green class x2,L2 blue class x3,ar grey class N0,N1,N2,G0,G1,G2,GU block class D eblock class AR block class bc text

Figure 2: Each GPU receives unique data at each step

▶️ Data Parallel: Forward Pass

flowchart LR subgraph D["`Data`"] direction TB x("`x0`") x1("`x1`") x2("`x2`") end direction LR subgraph G0["`GPU0`"] direction LR subgraph N0["`NN`"] end L0["`Loss`"] end subgraph G1["`GPU1`"] direction LR subgraph N1["`NN`"] end L1["`Loss`"] end subgraph G2["`GPU2`"] direction LR subgraph N2["`NN`"] end L2["`Loss`"] end ar("`Avg. Grads (∑ₙgₙ)/N`") x --> G0 x1 --> G1 x2 --> G2 N0 --> L0 N1 --> L1 N2 --> L2 L0 -.-> ar L1 -.-> ar L2 -.-> ar classDef eblock fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000 classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000 classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 class x,y0,L0 red class x1,L1 green class x2,L2 blue class x3,ar grey class N0,N1,N2,G0,G1,G2,GU block class D eblock class AR block class bc text

Figure 3: Average gradients across all GPUs

◀️ Data Parallel: Backward Pass

flowchart RL subgraph D["`Data`"] direction TB x("`x0`") x1("`x1`") x2("`x1`") end subgraph G0["`GPU0`"] direction RL subgraph N0["`NN`"] end L0["`Loss`"] end subgraph G1["`GPU1`"] direction RL subgraph N1["`NN`"] end L1["`Loss`"] end subgraph G2["`GPU2`"] direction RL subgraph N2["`NN`"] end L2["`Loss`"] end subgraph BC["`Send Updates`"] direction TB end BC -.-> G0 BC -.-> G1 BC -.-> G2 L0 ~~~ N0 L1 ~~~ N1 L2 ~~~ N2 G0 ~~~ x G1 ~~~ x1 G2 ~~~ x2 classDef grey fill:#cccccc,stroke:#333,stroke-width:1px,color:#000 classDef eblock fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000 classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000 classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 class x,y0,L0 red class x1,L1 green class x2,L2 blue class x3,ar grey class N0,N1,N2,G0,G1,G2,GU block class BC block class bc text class D eblock

Figure 4: Send global updates back to each GPU. See: PyTorch / Distributed Data Parallel

🔄 Collective Communication

  • Broadcast: Send data from one node to all other nodes
  • Reduce: Aggregate data from all nodes to one node
    • AllReduce: Aggregate data from all nodes to all nodes
  • Gather: Collect data from all nodes to one node
    • AllGather: Collect data from all nodes to all nodes
  • Scatter: Distribute data from one node to all other nodes

Reduce

  • Perform a reduction on data across ranks, send to individual
flowchart TD subgraph R0["`0`"] x0("`x0`") end subgraph R1["`1`"] x1("`x1`") end subgraph R2["`2`"] x2("`x2`") end subgraph R3["`3`"] x3("`x3`") end subgraph AR["`Reduce`"] xp["`z=reduce(x, 2, SUM)`"] end subgraph AR3["`3`"] end subgraph AR2["`2`"] xp2("`z`") end subgraph AR1["`1`"] end subgraph AR0["`0`"] end x0 --> AR x1 --> AR x2 --> AR x3 --> AR AR --> AR3 AR --> xp2 AR --> AR1 AR --> AR0 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000 classDef pink fill:#E599F7,stroke:#333,stroke-width:1px,color:#000 class R0,R1,R2,R3,AR,AR0,AR1,AR2,AR3 block class xp,xp2 purple class x0 red class x1 green class x2 blue class x3 yellow

Figure 5: Reduce operation: one rank receives the reduction of input values across ranks

🐣 Getting Started: In Practice

  • 🧠 Memory Management:
    • FSDP vs. ZeRO
    • Activation Checkpointing
    • Mixed Precision Training
    • Gradient Accumulation
    • Offloading to CPU/NVMe
🔄 Keeping things in Sync

Computation stalls during communication !!

Keeping the communication to computation ratio small is important for effective scaling.

📝 Plan of Attack

flowchart TB A{"Model Perfect?"} A -- no --> M{"Available Memory?"} A -- yes --> AD["Done"] M -- yes --> MY["Make Model Larger"] M -- no --> ZMP["<b>Free Up Memory</b>"] MY --> A ZMP --> MY A:::block M:::block AD:::block MY:::block ZMP:::sblock classDef text fill:#CCCCCC02,stroke:#838383,stroke-width:0px,color:#838383 classDef sblock fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383,white-space:collapse classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383

Figure 6: General strategy for scaling model training

🚀 Going Beyond Data Parallelism

Going beyond Data Parallelism: DeepSpeed + ZeRO

  • Depending on the ZeRO stage (1, 2, 3), we can offload:
    1. Stage 1: optimizer states (Pos)\left(P_{\mathrm{os}}\right)
    2. Stage 2: gradients + opt. states (Pos+g)\left(P_{\mathrm{os}+\mathrm{g}}\right)
    3. Stage 3: model params + grads + opt. states (Pos+g+p)\left(P_{\mathrm{os}+\mathrm{g}+\mathrm{p}}\right)

Figure 7: DeepSpeed + ZeRO

🕸️ Additional Parallelism Strategies

Pipeline Parallelism (PP)

flowchart TB subgraph G0["`GPU 0`"] direction LR a0("`Layer 0`") b0("`Layer 1`") end subgraph G1["`GPU 1`"] direction LR a1("`Layer 2`") b1("`Layer 3`") end a0 -.-> b0 b0 --> a1 a1 -.-> b1 classDef block fill:#CCCCCC02,stroke:#838383,stroke-width:1px,color:#838383 classDef red fill:#ff8181,stroke:#333,stroke-width:1px,color:#000 classDef orange fill:#FFC47F,stroke:#333,stroke-width:1px,color:#000 classDef yellow fill:#FFFF7F,stroke:#333,stroke-width:1px,color:#000 classDef green fill:#98E6A5,stroke:#333,stroke-width:1px,color:#000 classDef blue fill:#7DCAFF,stroke:#333,stroke-width:1px,color:#000 classDef purple fill:#FFCBE6,stroke:#333,stroke-width:1px,color:#000 class G0,G1 block class a0 red class b0 green class a1 blue class b1 yellow

Figure 8: Pipeline Parallelism

Tensor Parallel (TP)

Tensor Parallel (TP)

  • Split up network over multiple workers
  • Each receives disjoint subset
  • All communication associated with subsets are distributed
  • Communication whenever dataflow between two subsets
  • Typically more complicated to implement than data parallel training
  • Suitable when the model is too large to fit onto a single device (CPU / GPU)

Tensor (/ Model) Parallel Training: Example

Want to compute: y=ixiWi=x0W0+x1W1+x2W2y = \sum_{i} x_{i} W_{i} = x_0 * W_0 + x_1 * W_1 + x_2 * W_2
where each GPU has only its portion of the full weights as shown below

  1. Compute: y0=x0W0y_{0} = x_{0} * W_{0}\rightarrow GPU1
  2. Compute: y1=y0+x1W1y_{1} = y_{0} + x_{1} * W_{1}\rightarrow GPU2
  3. Compute: y=y1+x2W2=ixiWiy = y_{1} + x_{2} * W_{2} = \sum_{i} x_{i} W_{i}
flowchart LR subgraph X0["`GPU0`"] direction LR a("`W0`") end subgraph X1["`GPU1`"] direction LR b("`W1`") end subgraph X2["`GPU2`"] direction LR c("`W2`") end t0("`x0`")-->X0 X0 -->|"`x0 W0`"|X1 X1 -->|"`x0 W0 + x1 W1`"|X2 t1("`x1`") --> X1 t2("`x1`") --> X2

Figure 11

🔭 AI-for-Science source (@tenderizzation)


ChatGPT: explain this image

Reverse Diffusion ProcessForward Diffusion Process (\pi\rightarrow \mathcal{N})

🌀 Sequence-Window-Pipeline Parallelism SWiPe

  • SWiPe is a novel parallelism strategy for Swin-based Transformers
  • Hybrid 3D Parallelism strategy, combining:
    • Sequence parallelism (SP)
    • Window parallelism (WP)
    • Pipeline parallelism (PP)

Figure 23

Figure 24: SWiPe Communication Patterns

🚀 AERIS: Scaling Results

Figure 25: AERIS: Scaling Results

  • 10 EFLOPs (sustained) @ 120,960 GPUs
  • See (Hatanpää et al. (2025)) for additional details
  • arXiv:2509.13523

🌪️ Hurricane Laura

Figure 26: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

📓 References

Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. https://arxiv.org/abs/2509.13523.

Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. 2024. GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather. https://arxiv.org/abs/2312.15796.

Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, et al. 2023. DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies. https://arxiv.org/abs/2310.04610.

❤️ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

NORMAL  main  sam.onl/ talks/2025/10/15/index.mdx · Top 1:1