 Command

Sam Foreman's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman docs/test: Docs Test ideas: 💡 Ideas about/more: 🪪 More now: Now more: ➕ More posts: 📬 Posts projects: 📚 Projects talks: 🎙️ Talks webtui: Style posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/2026/01/07: 🎉 Happy New Year! posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2026/06/03: Production Pre-Training at Scale: The Good, the Bad, and the Restarts talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

Pre-Training AuroraGPT with TorchTitan

Pre-training AuroraGPT with TorchTitan and ezpz: Last Two Weeks (Apr 12–27, 2026)

Two-Week Summary (Apr 12–27, 2026)

Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.

Three major arcs:

  1. Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.

  2. Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.

  3. Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.

Key links:


Detailed Breakdown

Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes

Benchmarking (Apr 12–15)

LR Finder (Apr 12–14)

  • Implemented exponential LR sweep with derivative-based blow-up detection
  • Ran sweeps for 2B/20B × {AdamW, Muon, SophiaG} across 3 machines
  • Added model-size-aware weight init std = sqrt(2/(5*d)) — increased Muon’s usable LR range by 3x
  • Key results: AdamW 1.3e-3, Muon 2.4e-3, SophiaG 3.1e-4
  • Reports: LR finder README, Sunspot results, Aurora results, Polaris results

Scaling Study (Apr 12)

  • Built multi-node scaling study tooling
  • Ran weak scaling 1–64 nodes on Sunspot (torch 2.10)
  • 2B: 74% efficiency at 64N, 20B: 87% efficiency
  • Report: scaling/

Upstream Syncs (Apr 12–18, syncs 6–14)

  • Merged 8 upstream batches including fused QKV GQAttention, token dispatcher, expert parallelism, GraphTrainer improvements
  • Replayed llama3/agpt/ and deepseek_v3/moe/ changes
  • Updated attention backend and profiling APIs for upstream compatibility
  • Full log: upstream-sync.md

XPU Bug Fixes (Apr 18)

RL Experiment (Apr 18)

  • Added TRL-based GRPO training for sum-of-digits task on XPU — 525d0540
  • No vLLM dependency — uses HF .generate() for portability
  • Docs: rl/README.md

Week 1.5: Apr 18–25 — Production Readiness

Torch 2.12 Benchmarks (Apr 18)

  • Ran comparison benchmarks on Sunspot
  • 2B: +11% TPS, -49% memory; 20B: +29% TPS
  • Discovered 80B AC+TP regression (DeviceMesh in autograd graph)
  • First successful EP (expert parallelism) runs for MoE
  • Report: torch 2.12 benchmark

LR Finder Extensions (Apr 20–21)

  • Added 80B LR finder (AdamW only — Muon/SophiaG overflow at dim=9216) — 80B LR finder report
  • Added MoE LR finder (5 configs × 3 optimizers) — MoE LR finder report
  • Added GAS sweep to test gradient accumulation effects
  • Documented Muon bf16 overflow limitation for large models

XPU Fixes (Apr 23)

  • Fixed torch 2.10 XCCL barrier hangs (gloo fallback) — 312045b3
  • Reverted DTensor TP (use_local_output=False) — shape mismatches on XPU
  • Merged 3 upstream sync batches (full DTensor TP, GraphTrainer SAC+bucketing)

Torch 2.13 Environment (Apr 25)

  • Set up .venv/ with PyTorch 2.13 built from source for XPU
  • Created ezpz yeet-env integration for copying to compute nodes
  • Created venv-based training scripts for agpt 2B/20B — 392863f9
  • Guide: running-with-newer-pytorch.md

2B Scaling Study on Torch 2.13 (Apr 25)

  • Ran weak scaling 2–64 nodes on Sunspot
  • 7,142 TPS/GPU at 2N (27.6% MFU) — +23% over torch 2.10
  • Near-perfect scaling to 8N, 94% efficiency at 64N
  • Report: scaling/agpt-2b.md

Production Training (Apr 25)

  • Created production training scripts with checkpoint management
  • Started production runs for 2B (SophiaG) and 20B
  • Added per-model tracking with loss curve plots and W&B integration
  • Tracking: production/

Week 2: Apr 26–27 — Optimizer Competition

RL Multi-Task Refactor (Apr 26)

  • Refactored from hardcoded sum-of-digits to pluggable task registry — 7eab3332
  • RLTask dataclass + register_task() / get_task()
  • Added 3 new tasks: multiply, word_sort, countdown
  • Code: rl/tasks/

Docs Reorganization (Apr 26)

  • Restructured 12 root-level docs into configs/, guides/, scaling/, production/ — commits 2f47a46a, 68130bac, ce4d079e, cb132a54
  • Fixed all 31 internal cross-references (link checker: 0 broken) — f650e3eb

Generic HF Dataset Streaming (Apr 26)

  • Created datasets.pyregister_hf_dataset() + auto-fallback for arbitrary HF paths — 329bb5cb
  • --dataloader.dataset stanfordnlp/imdb just works, no code changes
  • Pre-registered 7 common datasets
  • Downloaded FineWeb-Edu 100BT locally (267 GB, /lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/) — 5cb32036

New Optimizers (Apr 26)

Architecture Tweaks (Apr 26–27)

  • QK-Norm — RMSNorm on Q,K before attention. New 2B_qknorm variant. Commit: e6e666cb
  • Logit Softcapping — FlexAttention score_mod with tanh cap at 30.0. Falls back to eager on XPU (4x slower — Triton-XPU can’t codegen tanh). Commit: 01b88cfe, fix: 9d9e7529
  • ReLU² — Subclassed FeedForward with relu(x)². Didn’t help (3.92 vs 3.80 baseline). Commit: 01b88cfe
  • WSM — Checkpoint merging utility (eval/merge_checkpoints.py). Commit: 01b88cfe

Competition Results

Round 1–3: Speedrun — 2N, GBS=48, 1000 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n2-1000steps/ Jobs: 12465381–85 (completed), 12465386–87 (Muon resubmit, 3h walltime) Configs: competition/configs.py

RankConfigLossTPS
1Muon3.5574,556
2AdamW+QK-Norm3.5697,178
3Mano+QK-Norm3.6046,980
4Mano3.6317,048
5AdamW3.8017,245

10B Full Training — 8N, GBS=384, ~3,178 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n8-10BT/ Jobs: 12465430–34 (8 nodes each, 6h walltime)

RankConfigLossTPS
1AdamW2.7117,354
2AdamW+QK-Norm2.7207,480
3Mano+QK-Norm2.8547,346
4Mano2.8757,429

Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n2-gas8-1000steps/ Jobs: 12465460–67 (2 nodes each, 6h walltime) Dataset: Local FineWeb-Edu 100BT (/lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/)

RankConfigLossTPS
1AdamW+QK-Norm3.2057,428
2AdamW3.2207,397
3Mano3.2947,397
4Mano+QK-Norm3.3077,423
5Mano (8.5e-4)3.3287,348
6AdamW (3.7e-3)5.8847,603

Key Discoveries

  • Muon’s 35% speed penalty is inherent to Newton-Schulz on XPU, not fixable by implementation — confirmed by torch.optim.Muon matching custom Muon TPS. Commit: a382213e
  • QK-Norm helps massively in short runs (+0.23) but washes out at 10B (+0.009). Revives in round 4 decay phase (+0.015)
  • Rankings flip between short/long training: Muon/Mano win speedruns, AdamW wins in cosine decay phase at scale
  • FlexAttention on XPU falls back to eager (4x slower) — Torchinductor does not support code generation for complex operators
  • HF streaming data shuffle causes ~1.3 loss variance between runs — commit: 16aa4d46
  • sqrt LR scaling rule too aggressive — AdamW at 3.7e-3 diverged (confirmed in round 4)
  • ReLU² hurts this architecture (3.92 vs 3.80 baseline)
  • Local dataset loader + softcap has a data sharding bug causing memorization at small GBS

Infrastructure Built

FeatureFileCommit
Generic HF datasetsdatasets.py329bb5cb
Competition configscompetition/configs.py9970fec8
PBS submit scriptcompetition/submit_run.sh9970fec8
WSM checkpoint mergingeval/merge_checkpoints.py01b88cfe
Training plot utilityeval/plot_training.py7b9ff388
Per-experiment trackingdocs/competitions/e96ba34c
Development journaldocs/journal.mdfcca79d8
Project rules.claude/CLAUDE.md47dc5e8b
RL task registryrl/tasks/7eab3332
Upstream sync logdocs/upstream-sync.md20 entries

First Draft

High-Level

Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.

Three major arcs:

  1. Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.
  2. Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.
  3. Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.

Detailed Breakdown

Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes

Benchmarking (Apr 12–15)

  • Built run_benchmarks.sh with configurable model sweeps
  • Ran 2-node benchmarks across all 11 agpt configs and 7 MoE configs on Sunspot, Aurora, and Polaris
  • Built comprehensive 80B throughput leaderboard (18 experiments, 5 model variants × TP sweeps)
  • Best 80B: 80B_wide TP=4 compile = 68 TPS, 11.8% MFU

LR Finder (Apr 12–14)

  • Implemented exponential LR sweep with derivative-based blow-up detection
  • Ran sweeps for 2B/20B × {AdamW, Muon, SophiaG}
  • Added model-size-aware weight init (std = sqrt(2/(5*d))) — increased Muon’s usable LR range by 3x
  • Key results: AdamW 1.3e-3, Muon 2.4e-3, SophiaG 3.x

Scaling Study (Apr 12)

  • Built multi-node scaling study tooling
  • Ran weak scaling 1–64 nodes on Sunspot (torch 2.10)
  • 2B: 74% efficiency at 64N, 20B: 87% efficiency

Upstream Syncs (Apr 12–18, syncs 6–14)

  • Merged 8 upstream batches including fused QKV GQA, token dispatcher patcher, expert parallelism, GraphTrainer improvements
  • Replayed llama3 → agpt and deepseek_v3 → moe changes
  • Updated attention backend and profiling APIs for upstream compatibility

XPU Bug Fixes (Apr 18)

  • Fixed SDPA set_priority=True dynamo bug (EzpzScaleOn)
  • Fixed IPEX import causing 80B TP=2 OOM on Aurora (+60 MiB overhead)
  • Restored 80B TP=2 on Aurora: 88 TPS, 16% MFU
  • Enabled MoE expert parallelism on torch 2.12+

RL Experiment (Apr 18)

  • Added TRL-based GRPO training for sum-of-digits task
  • No vLLM dependency — uses HF .generate() for portability

Week 1.5: Apr 18–25 — Production Readiness

Torch 2.12 Benchmarks (Apr 18)

  • Ran comparison benchmarks on Sunspot
  • 2B: +11% TPS, -49% memory; 20B: +29% TPS
  • Discovered 80B AC+TP regression (DeviceMesh in autocast)
  • First successful EP (expert parallelism) runs for MoE

LR Finder Extensions (Apr 20–21)

  • Added 80B LR finder (AdamW only — Muon/SophiaG overflow)
  • Added MoE LR finder (5 configs × 3 optimizers)
  • Added GAS sweep to test gradient accumulation effects
  • Documented Muon bf16 overflow limitation for large models

XPU Fixes (Apr 23)

  • Fixed torch 2.10 XCCL barrier hangs (gloo fallback)
  • Reverted DTensor TP (use_local_output=False) — shape mismatches on XPU
  • Merged 3 upstream sync batches (full DTensor TP, GQA bucketing)

Torch 2.13 Environment (Apr 25)

  • Set up .venv/ with PyTorch 2.13 built from source for XPU
  • Created ezpz yeet-env integration for copying to compute nodes
  • Created venv-based training scripts for agpt 2B/20B

2B Scaling Study on Torch 2.13 (Apr 25)

  • Ran weak scaling 2–64 nodes on Sunspot
  • 7,142 TPS/GPU at 2N (27.6% MFU) — +23% over torch 2.10
  • Near-perfect scaling to 8N, 94% efficiency at 64N

Production Training (Apr 25)

  • Created production training scripts with checkpoint management
  • Started production runs for 2B (SophiaG) and 20B
  • Added per-model tracking with loss curve plots and W&B integration

Week 2: Apr 26–27 — Optimizer Competition

RL Multi-Task Refactor (Apr 26)

  • Refactored from hardcoded sum-of-digits to pluggable RLTask dataclass + register_task() / get_task()
  • Added 3 new tasks: multiply, word_sort, countdown
  • Added CLI args to train_grpo.py

Docs Reorganization (Apr 26)

  • Restructured 12 root-level docs into configs/, guides/, production/
  • Renamed production-training/production/
  • Consolidated 4 scaling/benchmark files into per-model structure
  • Fixed all 31 internal cross-references (link checker: 0 broken)

Generic HF Dataset Streaming (Apr 26)

  • Created datasets.pyregister_hf_dataset() + auto-registry for arbitrary HF paths
  • --dataloader.dataset stanfordnlp/imdb just works, no code changes
  • Pre-registered 7 common datasets
  • Downloaded FineWeb-Edu 100BT locally (267 GB) for reproducible runs

New Optimizers (Apr 26)

  • Mano (optimizer/mano.py) — manifold-normalized, uses Cayley transform instead of Newton-Schulz. Runs at AdamW speed (~7,200 TPS) vs Muon’s 4,700. Based on arxiv 2601.23000.
  • SPAM (optimizer/spam.py) — spike-aware gradient clipping with momentum reset. Based on arxiv 2501.06842.
  • TorchMuon_CompositeOptimizer wrapper for torch.optim.Muon (built-in since PyTorch 2.9). Confirmed same speed as custom Muon on XPU — Newton-Schulz overhead is algorithmic, not implementation.

Architecture Tweaks (Apr 26–27)

  • QK-Norm — RMSNorm on Q,K before attention. New 2B_qknorm variant.
  • Logit Softcapping — FlexAttention score_mod with tanh; falls back to eager on XPU (4x slower — Triton-XPU can’t codegen tanh).
  • ReLU² — Subclassed FeedForward with relu(x)². Didn’t help (3.92 vs 3.80 baseline).
  • WSM — Checkpoint merging utility (eval/merge_checkpoints.py).

Competition Results

Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs)

RankConfigLossTPS
1Muon3.5574,556
2AdamW+QK-Norm3.5697,178
3Mano+QK-Norm3.6046,980
4Mano3.6317,048
5AdamW3.8017,245

Round 4 (10B full training, 8 nodes, GBS=384, 5 configs)

RankConfigLossTPS
1AdamW2.7117,354
2AdamW+QK-Norm2.7207,480
3Mano+QK-Norm2.8547,346
4Mano2.8757,429

Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress)

  • Mano leading fast configs at step ~600 (3.71 vs AdamW)
  • Softcap showing extreme data efficiency but likely memorizing

Key Discoveries

  • Muon’s 35% speed penalty is inherent to Newton-Schulz, not fixable by implementation
  • QK-Norm helps massively in short runs (+0.23) but washes out at 10B (+0.009)
  • Rankings flip between short/long training: Muon/Mano win short, AdamW wins at scale
  • FlexAttention on XPU falls back to eager (4x slower) — Triton-XPU can’t codegen complex ops
  • HF streaming data shuffle causes ~1.3 loss variance
  • sqrt LR scaling rule is too aggressive — confirmed by divergence at 3.7e-3
  • ReLU² activation hurts this architecture (3.92 vs 3.80 baseline)

Infrastructure Built

  • datasets.py — generic HF dataset streaming + local parquet loading
  • competition/ — configs, submit script, PBS job management
  • eval/merge_checkpoints.py — WSM checkpoint merging
  • docs/competitions/ — per-experiment tracking with loss curves
  • docs/journal.md — session-by-session development log
  • .claude/CLAUDE.md — project rules that travel with the repo
  • 20 upstream syncs tracked in upstream-sync.md
  • Auto-updating loss curve plots via cron
NORMAL  main  sam.onl/ posts/2026/04/27/index.mdx · Top 1:1