Pre-Training AuroraGPT with TorchTitan
Pre-training AuroraGPT with TorchTitan and ezpz: Last Two Weeks (Apr 12–27, 2026)
Two-Week Summary (Apr 12–27, 2026)
Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.
Three major arcs:
-
Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.
-
Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.
-
Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.
Key links:
Detailed Breakdown
Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes
Benchmarking (Apr 12–15)
- Built
run_benchmarks.shwith configurable model sweeps - Ran 2-node benchmarks across all 11 agpt configs and 7 MoE configs on Sunspot, Aurora, and Polaris
- Built comprehensive 80B throughput leaderboard (18 experiments, 5 model variants × TP sweeps)
- Best 80B: 80B_wide TP=4 compile = 68 TPS, 11.8% MFU
- Reports: agpt experiments, MoE experiments, Polaris smoke tests
LR Finder (Apr 12–14)
- Implemented exponential LR sweep with derivative-based blow-up detection
- Ran sweeps for 2B/20B × {AdamW, Muon, SophiaG} across 3 machines
- Added model-size-aware weight init
std = sqrt(2/(5*d))— increased Muon’s usable LR range by 3x - Key results: AdamW 1.3e-3, Muon 2.4e-3, SophiaG 3.1e-4
- Reports: LR finder README, Sunspot results, Aurora results, Polaris results
Scaling Study (Apr 12)
- Built multi-node scaling study tooling
- Ran weak scaling 1–64 nodes on Sunspot (torch 2.10)
- 2B: 74% efficiency at 64N, 20B: 87% efficiency
- Report: scaling/
Upstream Syncs (Apr 12–18, syncs 6–14)
- Merged 8 upstream batches including fused QKV GQAttention, token dispatcher, expert parallelism, GraphTrainer improvements
- Replayed
llama3/→agpt/anddeepseek_v3/→moe/changes - Updated attention backend and profiling APIs for upstream compatibility
- Full log: upstream-sync.md
XPU Bug Fixes (Apr 18)
- Fixed SDPA
set_priority=Truedynamo bug —EzpzScaledDotProductAttention - Fixed IPEX import causing 80B TP=2 OOM on Aurora (+60 MiB overhead) —
994efcc1 - Restored 80B TP=2 on Aurora: 88 TPS, 16% MFU — restoration report
- Enabled MoE expert parallelism on torch 2.12+ —
4e338774
RL Experiment (Apr 18)
- Added TRL-based GRPO training for sum-of-digits task on XPU —
525d0540 - No vLLM dependency — uses HF
.generate()for portability - Docs: rl/README.md
Week 1.5: Apr 18–25 — Production Readiness
Torch 2.12 Benchmarks (Apr 18)
- Ran comparison benchmarks on Sunspot
- 2B: +11% TPS, -49% memory; 20B: +29% TPS
- Discovered 80B AC+TP regression (DeviceMesh in autograd graph)
- First successful EP (expert parallelism) runs for MoE
- Report: torch 2.12 benchmark
LR Finder Extensions (Apr 20–21)
- Added 80B LR finder (AdamW only — Muon/SophiaG overflow at dim=9216) — 80B LR finder report
- Added MoE LR finder (5 configs × 3 optimizers) — MoE LR finder report
- Added GAS sweep to test gradient accumulation effects
- Documented Muon bf16 overflow limitation for large models
XPU Fixes (Apr 23)
- Fixed torch 2.10 XCCL barrier hangs (gloo fallback) —
312045b3 - Reverted DTensor TP (
use_local_output=False) — shape mismatches on XPU - Merged 3 upstream sync batches (full DTensor TP, GraphTrainer SAC+bucketing)
Torch 2.13 Environment (Apr 25)
- Set up
.venv/with PyTorch 2.13 built from source for XPU - Created
ezpz yeet-envintegration for copying to compute nodes - Created venv-based training scripts for agpt 2B/20B —
392863f9 - Guide: running-with-newer-pytorch.md
2B Scaling Study on Torch 2.13 (Apr 25)
- Ran weak scaling 2–64 nodes on Sunspot
- 7,142 TPS/GPU at 2N (27.6% MFU) — +23% over torch 2.10
- Near-perfect scaling to 8N, 94% efficiency at 64N
- Report: scaling/agpt-2b.md
Production Training (Apr 25)
- Created production training scripts with checkpoint management
- Started production runs for 2B (SophiaG) and 20B
- Added per-model tracking with loss curve plots and W&B integration
- Tracking: production/
Week 2: Apr 26–27 — Optimizer Competition
RL Multi-Task Refactor (Apr 26)
- Refactored from hardcoded sum-of-digits to pluggable task registry —
7eab3332 RLTaskdataclass +register_task()/get_task()- Added 3 new tasks: multiply, word_sort, countdown
- Code:
rl/tasks/
Docs Reorganization (Apr 26)
- Restructured 12 root-level docs into
configs/,guides/,scaling/,production/— commits2f47a46a,68130bac,ce4d079e,cb132a54 - Fixed all 31 internal cross-references (link checker:
0 broken) —
f650e3eb
Generic HF Dataset Streaming (Apr 26)
- Created
datasets.py—register_hf_dataset()+ auto-fallback for arbitrary HF paths —329bb5cb --dataloader.dataset stanfordnlp/imdbjust works, no code changes- Pre-registered 7 common datasets
- Downloaded FineWeb-Edu 100BT locally (267 GB,
/lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/) —5cb32036
New Optimizers (Apr 26)
- Mano
(
optimizer/mano.py) — manifold-normalized, vector-norm ops instead of Newton-Schulz. Runs at AdamW speed (~7,200 TPS) vs Muon’s 4,700. Based on arxiv 2601.23000. Commit:03af1057 - SPAM
(
optimizer/spam.py) — spike-aware gradient clipping + periodic momentum reset. Based on arxiv 2501.06842. Commit:03af1057 - TorchMuon —
_CompositeOptimizerwrapper fortorch.optim.Muon(built-in since PyTorch 2.9). Confirmed same speed as custom Muon on XPU — Newton-Schulz overhead is algorithmic, not implementation. Commit:a382213e
Architecture Tweaks (Apr 26–27)
- QK-Norm — RMSNorm on Q,K before attention.
New
2B_qknormvariant. Commit:e6e666cb - Logit Softcapping — FlexAttention
score_modwith tanh cap at 30.0. Falls back to eager on XPU (4x slower — Triton-XPU can’t codegentanh). Commit:01b88cfe, fix:9d9e7529 - ReLU² — Subclassed FeedForward with
relu(x)². Didn’t help (3.92 vs 3.80 baseline). Commit:01b88cfe - WSM — Checkpoint merging utility
(
eval/merge_checkpoints.py). Commit:01b88cfe
Competition Results
Round 1–3: Speedrun — 2N, GBS=48, 1000 steps
W&B:
aurora_gpt/torchtitan.ezpz.train
Docs: competitions/agpt2b-n2-1000steps/
Jobs: 12465381–85 (completed), 12465386–87 (Muon resubmit, 3h walltime)
Configs:
competition/configs.py
| Rank | Config | Loss | TPS |
|---|---|---|---|
| 1 | Muon | 3.557 | 4,556 |
| 2 | AdamW+QK-Norm | 3.569 | 7,178 |
| 3 | Mano+QK-Norm | 3.604 | 6,980 |
| 4 | Mano | 3.631 | 7,048 |
| 5 | AdamW | 3.801 | 7,245 |
10B Full Training — 8N, GBS=384, ~3,178 steps
W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n8-10BT/ Jobs: 12465430–34 (8 nodes each, 6h walltime)
| Rank | Config | Loss | TPS |
|---|---|---|---|
| 1 | AdamW | 2.711 | 7,354 |
| 2 | AdamW+QK-Norm | 2.720 | 7,480 |
| 3 | Mano+QK-Norm | 2.854 | 7,346 |
| 4 | Mano | 2.875 | 7,429 |
Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps
W&B:
aurora_gpt/torchtitan.ezpz.train
Docs:
competitions/agpt2b-n2-gas8-1000steps/
Jobs: 12465460–67 (2 nodes each, 6h walltime) Dataset: Local FineWeb-Edu
100BT (/lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/)
| Rank | Config | Loss | TPS |
|---|---|---|---|
| 1 | AdamW+QK-Norm | 3.205 | 7,428 |
| 2 | AdamW | 3.220 | 7,397 |
| 3 | Mano | 3.294 | 7,397 |
| 4 | Mano+QK-Norm | 3.307 | 7,423 |
| 5 | Mano (8.5e-4) | 3.328 | 7,348 |
| 6 | AdamW (3.7e-3) | 5.884 | 7,603 |
Key Discoveries
- Muon’s 35% speed penalty is inherent to Newton-Schulz on XPU, not fixable
by implementation — confirmed by
torch.optim.Muonmatching custom Muon TPS. Commit:a382213e - QK-Norm helps massively in short runs (+0.23) but washes out at 10B (+0.009). Revives in round 4 decay phase (+0.015)
- Rankings flip between short/long training: Muon/Mano win speedruns, AdamW wins in cosine decay phase at scale
- FlexAttention on XPU falls back to eager (4x slower) —
Torchinductor does not support code generation for complex operators - HF streaming data shuffle causes ~1.3 loss variance between runs — commit:
16aa4d46 - sqrt LR scaling rule too aggressive — AdamW at 3.7e-3 diverged (confirmed in round 4)
- ReLU² hurts this architecture (3.92 vs 3.80 baseline)
- Local dataset loader + softcap has a data sharding bug causing memorization at small GBS
Infrastructure Built
| Feature | File | Commit |
|---|---|---|
| Generic HF datasets | datasets.py | 329bb5cb |
| Competition configs | competition/configs.py | 9970fec8 |
| PBS submit script | competition/submit_run.sh | 9970fec8 |
| WSM checkpoint merging | eval/merge_checkpoints.py | 01b88cfe |
| Training plot utility | eval/plot_training.py | 7b9ff388 |
| Per-experiment tracking | docs/competitions/ | e96ba34c |
| Development journal | docs/journal.md | fcca79d8 |
| Project rules | .claude/CLAUDE.md | 47dc5e8b |
| RL task registry | rl/tasks/ | 7eab3332 |
| Upstream sync log | docs/upstream-sync.md | 20 entries |
First Draft
High-Level
Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.
Three major arcs:
- Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.
- Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.
- Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.
Detailed Breakdown
Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes
Benchmarking (Apr 12–15)
- Built
run_benchmarks.shwith configurable model sweeps - Ran 2-node benchmarks across all 11 agpt configs and 7 MoE configs on Sunspot, Aurora, and Polaris
- Built comprehensive 80B throughput leaderboard (18 experiments, 5 model variants × TP sweeps)
- Best 80B:
80B_wide TP=4 compile= 68 TPS, 11.8% MFU
LR Finder (Apr 12–14)
- Implemented exponential LR sweep with derivative-based blow-up detection
- Ran sweeps for 2B/20B × {AdamW, Muon, SophiaG}
- Added model-size-aware weight init (
std = sqrt(2/(5*d))) — increased Muon’s usable LR range by 3x - Key results: AdamW 1.3e-3, Muon 2.4e-3, SophiaG 3.x
Scaling Study (Apr 12)
- Built multi-node scaling study tooling
- Ran weak scaling 1–64 nodes on Sunspot (torch 2.10)
- 2B: 74% efficiency at 64N, 20B: 87% efficiency
Upstream Syncs (Apr 12–18, syncs 6–14)
- Merged 8 upstream batches including fused QKV GQA, token dispatcher patcher, expert parallelism, GraphTrainer improvements
- Replayed llama3 → agpt and deepseek_v3 → moe changes
- Updated attention backend and profiling APIs for upstream compatibility
XPU Bug Fixes (Apr 18)
- Fixed SDPA
set_priority=Truedynamo bug (EzpzScaleOn) - Fixed IPEX import causing 80B TP=2 OOM on Aurora (+60 MiB overhead)
- Restored 80B TP=2 on Aurora: 88 TPS, 16% MFU
- Enabled MoE expert parallelism on torch 2.12+
RL Experiment (Apr 18)
- Added TRL-based GRPO training for sum-of-digits task
- No vLLM dependency — uses HF
.generate()for portability
Week 1.5: Apr 18–25 — Production Readiness
Torch 2.12 Benchmarks (Apr 18)
- Ran comparison benchmarks on Sunspot
- 2B: +11% TPS, -49% memory; 20B: +29% TPS
- Discovered 80B AC+TP regression (DeviceMesh in autocast)
- First successful EP (expert parallelism) runs for MoE
LR Finder Extensions (Apr 20–21)
- Added 80B LR finder (AdamW only — Muon/SophiaG overflow)
- Added MoE LR finder (5 configs × 3 optimizers)
- Added GAS sweep to test gradient accumulation effects
- Documented Muon bf16 overflow limitation for large models
XPU Fixes (Apr 23)
- Fixed torch 2.10 XCCL barrier hangs (gloo fallback)
- Reverted DTensor TP (
use_local_output=False) — shape mismatches on XPU - Merged 3 upstream sync batches (full DTensor TP, GQA bucketing)
Torch 2.13 Environment (Apr 25)
- Set up
.venv/with PyTorch 2.13 built from source for XPU - Created
ezpzyeet-env integration for copying to compute nodes - Created venv-based training scripts for agpt 2B/20B
2B Scaling Study on Torch 2.13 (Apr 25)
- Ran weak scaling 2–64 nodes on Sunspot
- 7,142 TPS/GPU at 2N (27.6% MFU) — +23% over torch 2.10
- Near-perfect scaling to 8N, 94% efficiency at 64N
Production Training (Apr 25)
- Created production training scripts with checkpoint management
- Started production runs for 2B (SophiaG) and 20B
- Added per-model tracking with loss curve plots and W&B integration
Week 2: Apr 26–27 — Optimizer Competition
RL Multi-Task Refactor (Apr 26)
- Refactored from hardcoded sum-of-digits to pluggable
RLTaskdataclass +register_task()/get_task() - Added 3 new tasks: multiply, word_sort, countdown
- Added CLI args to
train_grpo.py
Docs Reorganization (Apr 26)
- Restructured 12 root-level docs into
configs/,guides/,production/ - Renamed
production-training/→production/ - Consolidated 4 scaling/benchmark files into per-model structure
- Fixed all 31 internal cross-references (link checker: 0 broken)
Generic HF Dataset Streaming (Apr 26)
- Created
datasets.py—register_hf_dataset()+ auto-registry for arbitrary HF paths --dataloader.dataset stanfordnlp/imdbjust works, no code changes- Pre-registered 7 common datasets
- Downloaded FineWeb-Edu 100BT locally (267 GB) for reproducible runs
New Optimizers (Apr 26)
- Mano (
optimizer/mano.py) — manifold-normalized, uses Cayley transform instead of Newton-Schulz. Runs at AdamW speed (~7,200 TPS) vs Muon’s 4,700. Based on arxiv 2601.23000. - SPAM (
optimizer/spam.py) — spike-aware gradient clipping with momentum reset. Based on arxiv 2501.06842. - TorchMuon —
_CompositeOptimizerwrapper fortorch.optim.Muon(built-in since PyTorch 2.9). Confirmed same speed as custom Muon on XPU — Newton-Schulz overhead is algorithmic, not implementation.
Architecture Tweaks (Apr 26–27)
- QK-Norm — RMSNorm on Q,K before attention.
New
2B_qknormvariant. - Logit Softcapping — FlexAttention
score_modwith tanh; falls back to eager on XPU (4x slower — Triton-XPU can’t codegen tanh). - ReLU² — Subclassed FeedForward with
relu(x)². Didn’t help (3.92 vs 3.80 baseline). - WSM — Checkpoint merging utility (
eval/merge_checkpoints.py).
Competition Results
Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs)
| Rank | Config | Loss | TPS |
|---|---|---|---|
| 1 | Muon | 3.557 | 4,556 |
| 2 | AdamW+QK-Norm | 3.569 | 7,178 |
| 3 | Mano+QK-Norm | 3.604 | 6,980 |
| 4 | Mano | 3.631 | 7,048 |
| 5 | AdamW | 3.801 | 7,245 |
Round 4 (10B full training, 8 nodes, GBS=384, 5 configs)
| Rank | Config | Loss | TPS |
|---|---|---|---|
| 1 | AdamW | 2.711 | 7,354 |
| 2 | AdamW+QK-Norm | 2.720 | 7,480 |
| 3 | Mano+QK-Norm | 2.854 | 7,346 |
| 4 | Mano | 2.875 | 7,429 |
Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress)
- Mano leading fast configs at step ~600 (3.71 vs AdamW)
- Softcap showing extreme data efficiency but likely memorizing
Key Discoveries
- Muon’s 35% speed penalty is inherent to Newton-Schulz, not fixable by implementation
- QK-Norm helps massively in short runs (+0.23) but washes out at 10B (+0.009)
- Rankings flip between short/long training: Muon/Mano win short, AdamW wins at scale
- FlexAttention on XPU falls back to eager (4x slower) — Triton-XPU can’t codegen complex ops
- HF streaming data shuffle causes ~1.3 loss variance
sqrtLR scaling rule is too aggressive — confirmed by divergence at 3.7e-3- ReLU² activation hurts this architecture (3.92 vs 3.80 baseline)
Infrastructure Built
datasets.py— generic HF dataset streaming + local parquet loadingcompetition/— configs, submit script, PBS job managementeval/merge_checkpoints.py— WSM checkpoint mergingdocs/competitions/— per-experiment tracking with loss curvesdocs/journal.md— session-by-session development log.claude/CLAUDE.md— project rules that travel with the repo- 20 upstream syncs tracked in
upstream-sync.md - Auto-updating loss curve plots via cron