Pre-Training AuroraGPT with TorchTitan

Sam Foreman Apr 27, 2026 04/27/26 11 min read

Pre-training AuroraGPT with TorchTitan and ezpz: Last Two Weeks (Apr 12–27, 2026)

Two-Week Summary (Apr 12–27, 2026)

Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.

Three major arcs:

Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.
Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.
Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.

Key links:

Detailed Breakdown

Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes

Benchmarking (Apr 12–15)

Built run_benchmarks.sh with configurable model sweeps
Ran 2-node benchmarks across all 11 agpt configs and 7 MoE configs on Sunspot, Aurora, and Polaris
Built comprehensive 80B throughput leaderboard (18 experiments, 5 model variants × TP sweeps)
Best 80B: 80B_wide TP=4 compile = 68 TPS, 11.8% MFU
Reports: agpt experiments, MoE experiments, Polaris smoke tests

LR Finder (Apr 12–14)

Implemented exponential LR sweep with derivative-based blow-up detection
Ran sweeps for 2B/20B × {AdamW, Muon, SophiaG} across 3 machines
Added model-size-aware weight init std = sqrt(2/(5*d)) — increased Muon’s usable LR range by 3x
Key results: AdamW 1.3e-3, Muon 2.4e-3, SophiaG 3.1e-4
Reports: LR finder README, Sunspot results, Aurora results, Polaris results

Scaling Study (Apr 12)

Built multi-node scaling study tooling
Ran weak scaling 1–64 nodes on Sunspot (torch 2.10)
2B: 74% efficiency at 64N, 20B: 87% efficiency
Report: scaling/

Upstream Syncs (Apr 12–18, syncs 6–14)

Merged 8 upstream batches including fused QKV GQAttention, token dispatcher, expert parallelism, GraphTrainer improvements
Replayed llama3/ → agpt/ and deepseek_v3/ → moe/ changes
Updated attention backend and profiling APIs for upstream compatibility
Full log: upstream-sync.md

XPU Bug Fixes (Apr 18)

Fixed SDPA set_priority=True dynamo bug — EzpzScaledDotProductAttention
Fixed IPEX import causing 80B TP=2 OOM on Aurora (+60 MiB overhead) — 994efcc1
Restored 80B TP=2 on Aurora: 88 TPS, 16% MFU — restoration report
Enabled MoE expert parallelism on torch 2.12+ — 4e338774

RL Experiment (Apr 18)

Added TRL-based GRPO training for sum-of-digits task on XPU — 525d0540
No vLLM dependency — uses HF .generate() for portability
Docs: rl/README.md

Week 1.5: Apr 18–25 — Production Readiness

Torch 2.12 Benchmarks (Apr 18)

Ran comparison benchmarks on Sunspot
2B: +11% TPS, -49% memory; 20B: +29% TPS
Discovered 80B AC+TP regression (DeviceMesh in autograd graph)
First successful EP (expert parallelism) runs for MoE
Report: torch 2.12 benchmark

LR Finder Extensions (Apr 20–21)

Added 80B LR finder (AdamW only — Muon/SophiaG overflow at dim=9216) — 80B LR finder report
Added MoE LR finder (5 configs × 3 optimizers) — MoE LR finder report
Added GAS sweep to test gradient accumulation effects
Documented Muon bf16 overflow limitation for large models

XPU Fixes (Apr 23)

Fixed torch 2.10 XCCL barrier hangs (gloo fallback) — 312045b3
Reverted DTensor TP (use_local_output=False) — shape mismatches on XPU
Merged 3 upstream sync batches (full DTensor TP, GraphTrainer SAC+bucketing)

Torch 2.13 Environment (Apr 25)

Set up .venv/ with PyTorch 2.13 built from source for XPU
Created ezpz yeet-env integration for copying to compute nodes
Created venv-based training scripts for agpt 2B/20B — 392863f9
Guide: running-with-newer-pytorch.md

2B Scaling Study on Torch 2.13 (Apr 25)

Ran weak scaling 2–64 nodes on Sunspot
7,142 TPS/GPU at 2N (27.6% MFU) — +23% over torch 2.10
Near-perfect scaling to 8N, 94% efficiency at 64N
Report: scaling/agpt-2b.md

Production Training (Apr 25)

Created production training scripts with checkpoint management
Started production runs for 2B (SophiaG) and 20B
Added per-model tracking with loss curve plots and W&B integration
Tracking: production/

Week 2: Apr 26–27 — Optimizer Competition

RL Multi-Task Refactor (Apr 26)

Refactored from hardcoded sum-of-digits to pluggable task registry — 7eab3332
RLTask dataclass + register_task() / get_task()
Added 3 new tasks: multiply, word_sort, countdown
Code: rl/tasks/

Docs Reorganization (Apr 26)

Restructured 12 root-level docs into configs/, guides/, scaling/, production/ — commits 2f47a46a, 68130bac, ce4d079e, cb132a54
Fixed all 31 internal cross-references (link checker: 0 broken) — f650e3eb

Generic HF Dataset Streaming (Apr 26)

Created datasets.py — register_hf_dataset() + auto-fallback for arbitrary HF paths — 329bb5cb
--dataloader.dataset stanfordnlp/imdb just works, no code changes
Pre-registered 7 common datasets
Downloaded FineWeb-Edu 100BT locally (267 GB, /lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/) — 5cb32036

New Optimizers (Apr 26)

Mano (optimizer/mano.py) — manifold-normalized, vector-norm ops instead of Newton-Schulz. Runs at AdamW speed (~7,200 TPS) vs Muon’s 4,700. Based on arxiv 2601.23000. Commit: 03af1057
SPAM (optimizer/spam.py) — spike-aware gradient clipping + periodic momentum reset. Based on arxiv 2501.06842. Commit: 03af1057
TorchMuon — _CompositeOptimizer wrapper for torch.optim.Muon (built-in since PyTorch 2.9). Confirmed same speed as custom Muon on XPU — Newton-Schulz overhead is algorithmic, not implementation. Commit: a382213e

Architecture Tweaks (Apr 26–27)

QK-Norm — RMSNorm on Q,K before attention. New 2B_qknorm variant. Commit: e6e666cb
Logit Softcapping — FlexAttention score_mod with tanh cap at 30.0. Falls back to eager on XPU (4x slower — Triton-XPU can’t codegen tanh). Commit: 01b88cfe, fix: 9d9e7529
ReLU² — Subclassed FeedForward with relu(x)². Didn’t help (3.92 vs 3.80 baseline). Commit: 01b88cfe
WSM — Checkpoint merging utility (eval/merge_checkpoints.py). Commit: 01b88cfe

Competition Results

Round 1–3: Speedrun — 2N, GBS=48, 1000 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n2-1000steps/ Jobs: 12465381–85 (completed), 12465386–87 (Muon resubmit, 3h walltime) Configs: competition/configs.py

Rank	Config	Loss	TPS
1	Muon	3.557	4,556
2	AdamW+QK-Norm	3.569	7,178
3	Mano+QK-Norm	3.604	6,980
4	Mano	3.631	7,048
5	AdamW	3.801	7,245

10B Full Training — 8N, GBS=384, ~3,178 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n8-10BT/ Jobs: 12465430–34 (8 nodes each, 6h walltime)

Rank	Config	Loss	TPS
1	AdamW	2.711	7,354
2	AdamW+QK-Norm	2.720	7,480
3	Mano+QK-Norm	2.854	7,346
4	Mano	2.875	7,429

Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps

W&B: aurora_gpt/torchtitan.ezpz.train Docs: competitions/agpt2b-n2-gas8-1000steps/ Jobs: 12465460–67 (2 nodes each, 6h walltime) Dataset: Local FineWeb-Edu 100BT (/lus/tegu/projects/datasets/datasets/fineweb-edu-100BT/)

Rank	Config	Loss	TPS
1	AdamW+QK-Norm	3.205	7,428
2	AdamW	3.220	7,397
3	Mano	3.294	7,397
4	Mano+QK-Norm	3.307	7,423
5	Mano (8.5e-4)	3.328	7,348
6	AdamW (3.7e-3)	5.884	7,603

Key Discoveries

Muon’s 35% speed penalty is inherent to Newton-Schulz on XPU, not fixable by implementation — confirmed by torch.optim.Muon matching custom Muon TPS. Commit: a382213e
QK-Norm helps massively in short runs (+0.23) but washes out at 10B (+0.009). Revives in round 4 decay phase (+0.015)
Rankings flip between short/long training: Muon/Mano win speedruns, AdamW wins in cosine decay phase at scale
FlexAttention on XPU falls back to eager (4x slower) — Torchinductor does not support code generation for complex operators
HF streaming data shuffle causes ~1.3 loss variance between runs — commit: 16aa4d46
sqrt LR scaling rule too aggressive — AdamW at 3.7e-3 diverged (confirmed in round 4)
ReLU² hurts this architecture (3.92 vs 3.80 baseline)
Local dataset loader + softcap has a data sharding bug causing memorization at small GBS

Infrastructure Built

Feature	File	Commit
Generic HF datasets	`datasets.py`	`329bb5cb`
Competition configs	`competition/configs.py`	`9970fec8`
PBS submit script	`competition/submit_run.sh`	`9970fec8`
WSM checkpoint merging	`eval/merge_checkpoints.py`	`01b88cfe`
Training plot utility	`eval/plot_training.py`	`7b9ff388`
Per-experiment tracking	`docs/competitions/`	`e96ba34c`
Development journal	`docs/journal.md`	`fcca79d8`
Project rules	`.claude/CLAUDE.md`	`47dc5e8b`
RL task registry	`rl/tasks/`	`7eab3332`
Upstream sync log	`docs/upstream-sync.md`	20 entries

First Draft

High-Level

Over ~2 weeks and 291 commits, we built a comprehensive training optimization research platform on Intel XPU (Sunspot), going from basic benchmarking to a systematic optimizer/architecture competition with live experiment tracking.

Three major arcs:

Benchmarking & Infrastructure (Apr 12–18): Built LR finder, scaling study tooling, and benchmark scripts across 3 ALCF machines. Established baselines for agpt_2b/20b/80b and MoE models. Fixed critical XPU bugs (XCCL barriers, 80B TP=2 OOM). Merged upstream fused QKV, token dispatcher, and expert parallelism.
Production Readiness (Apr 18–25): Created production training scripts, set up torch 2.13 venv (+23% throughput), ran 2B weak scaling study (2–64 nodes, 94% efficiency at 64N), and started production training runs with checkpointing and W&B tracking.
Optimizer Competition (Apr 26–27): Launched a systematic optimizer/architecture competition. Implemented 3 new optimizers (Mano, SPAM, TorchMuon), 3 architecture tweaks (QK-Norm, logit softcapping, ReLU²), and ran 40+ experiments across 4 rounds. Built generic HF dataset streaming, local dataset caching, and live loss curve plotting.