AuroraGPT: Training Foundation Models on Supercomputers
Sam Foreman — Argonne National Laboratory 2025-12-16
samforeman.me/talks/2025/12/16/slides
AuroraGPT: Toolbox
- Datasets and data pipelines (how do we deal with scientific data?)
- Software infrastructure and workflows (scalable, robust, extensible)
- Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)
Team Leads
Planning
Rick Stevens (lead)
Ian Foster
Rinku Gupta
Mike Papka
Arvind Ramanathan
Fangfang XiaData
Ian Foster
Robert UnderwoodTraining
Venkat Vishwanath
Sam ForemanEvaluation
Franck Cappello
Sandeep Madireddy
Bo LiPost
Eliu Huerta
Azton WellsInference
Rajeev ThakurComms
Charlie Catlett
David MartinDistribution
Brad UllrichTeams
- Planning
- Data Prep
- Accumulate 20+ T tokens of high-quality scientific text and structured data
- Models / Training (co-led: Venkat Vishwanath, Sam Foreman)
- Train (entirely from scratch) a series of models on publicly available data
- Evaluation
- Skills, trustworthiness, safety, robustness, privacy, machine ethics
- Post-Training
- Fine-tuning, alignment
- Inference
- Model serving, API development / public-facing web services
- Distribution
- Licensing, generating and distributing artifacts for public consumption
- Communication
Challenges
This is incredibly difficult in practice, due in part to:
- Brand new {hardware, architecture, software}
- Lack of native support in existing frameworks (though getting better!)
- General system stability
+10k Nodes (×1Node12XPU)⇒ +100k XPUs
- network performance
- file system stability (impacted by other users!)
- many unexpected difficulties occur at increasingly large scales
- Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}
AuroraGPT: Training
- To train a fixed model on trillions of tokens requires:
- Aggregating data from multiple different corpora (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
- Sampling each training batch according to a fixed distribution across corpora
- Building indices that map batches of tokens into these files (indexing)
- Designed to run serially on a single device
- Major bottleneck when debugging data pipeline at scale
AuroraGPT: Blending Data, Efficiently
- Original implementation:
- Slow (serial, single device)
- ~ 1 hr/2T tokens
- New implementation:
- Fast! (distributed, asynchronous)
- ~ 2 min/2T tokens (30× faster !!)
Figure: Time spent preparing 2T tokens
Training AuroraGPT-7B on 2T Tokens
Train (grey) and validation (blue) loss vs number of consumed training tokens for AuroraGPT-7B on 64 nodes of Aurora.
Training AuroraGPT-2B on 7T Tokens

(new) Loss vs number of consumed training tokens for AuroraGPT-2B on 256 (blue) and 520 nodes (grey) of Aurora. Both runs show stability through 7T tokens.
Features
argonne-lcf/Megatron-DeepSpeed
- Parallelism:
- {data, tensor, pipeline, sequence, …}
- Checkpoint Converters:
- Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
- DeepSpeed Integration:
- ZeRO Offloading
- Activation checkpointing
- AutoTP (WIP)
- ability to leverage features from DeepSpeed community
Features (even more!)
- 🧗 Optimizers (implemented by Marieme Ngom):
- Support for many different optimizers:
- Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, …
- See full list
- Large batch training
- Support for many different optimizers:
- Experiment Tracking:
- Automatic experiment and metric tracking with Weights & Biases
MProt-DPO
- Finalist: SC’24 ACM Gordon Bell Prize
- One of the first protein design toolkits that integrates:
- Text, (protein/gene) sequence, structure/conformational sampling modalities to build aligned representations for protein sequence-function mapping
Scaling Results (2024)
Figure: Scaling results for 3.5B model across ~38,400 GPUs
- ~ 4 EFLOPS @ Aurora
- 38,400 XPUs = 3200 [node] × 12 [XPU / node]
- Gordon Bell Finalist:
- MProt-DPO: Breaking the ExaFLOPS Barrier… (Dharuman et al. 2024)
MProt-DPO: Scaling Results
3.5B model
7B model
Loooooooooong Sequence Lengths
- Working with Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
- See my blog post for additional details
25B
33B
Maximum (achievable) SEQ_LEN for both 25B and 33B models (See: Song et al. 2023)
AERIS (2025)
Pixel-level Swin diffusion transformer in sizes from [1–80]B
High-Level Overview of AERIS

Rollout of AERIS model, specific humidity at 700m.
| Property | Description |
|---|---|
| Domain | Global |
| Resolution | 0.25° & 1.4° |
| Training Data | ERA5 (1979–2018) |
| Model Architecture | Swin Transformer |
| Speedup | O(10k–100k) |
Table: AERIS model + training setup. Speedup relative to PDE-based models (e.g. GFS).
Contributions
- Operates at the pixel level (1 × 1 patch size), guided by physical priors
- Medium-range forecast skill:
- Surpasses IFS ENS, competitive with GenCast (Price et al. 2024)
- Uniquely stable on seasonal scales to 90 days
- Enables scalable small-batch training on large supercomputers
- 10.21 ExaFLOPS
- @ 121,000 Intel XPUs (Aurora)
Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.
Issues with the Deterministic Approach
- ✕ Transformers:
- Deterministic
- Single input → single forecast
- ✓ Diffusion:
- Probabilistic
- Single input → ensemble of forecasts
- Captures uncertainty and variability in weather predictions
- Enables ensemble forecasting for better risk assessment
Transitioning to a Probabilistic Model
Reverse diffusion with the input condition, individual sampling steps t0→t64, the next time step estimate and the target output.

Reverse Diffusion Process (N→π)

Forward Diffusion Process (π→N)
Sequence-Window-Pipeline Parallelism SWiPe
SWiPeis a novel parallelism strategy for Swin-based Transformers- Hybrid 3D Parallelism strategy, combining:
- Sequence parallelism (
SP) - Window parallelism (
WP) - Pipeline parallelism (
PP)
- Sequence parallelism (
SWiPe Communication Patterns
AERIS: Scaling Results
AERIS: Scaling Results
- 10 EFLOPs (sustained) @ 120,960 GPUs
- See (Hatanpää et al. 2025) for additional details
- arXiv:2509.13523
Hurricane Laura

Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.
References
- Dharuman, G., Hippe, K., Brace, A., Foreman, S., et al. (2024). MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization. SC ‘24. doi:10.1109/SC41406.2024.00013
- Hatanpää, V., Ku, E., Stock, J., Emani, M., Foreman, S., et al. (2025). AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. arXiv:2509.13523
- Price, I., Sanchez-Gonzalez, A., Alet, F., et al. (2024). GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather. arXiv:2312.15796
- Song, S.L., Kruft, B., Zhang, M., et al. (2023). DeepSpeed4Science Initiative. arXiv:2310.04610
Acknowledgements
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
