AuroraGPT: Training Foundation Models on Supercomputers

Sam Foreman — Argonne National Laboratory 2025-12-16

samforeman.me/talks/2025/12/16/slides

AuroraGPT: Toolbox

Datasets and data pipelines (how do we deal with scientific data?)
Software infrastructure and workflows (scalable, robust, extensible)
Evaluation of state-of-the-art LLM Models (how do they perform on scientific tasks?)

 ezpz

 saforem2/ezpz Write once, run anywhere

 Training

 argonne-lcf/Megatron-DeepSpeed For the largest of large language models

󰑮 Running

 argonne-lcf/inference-endpoints Inference endpoints for LLMs, hosted @ ALCF

Team Leads

Planning

Rick Stevens (lead)

Ian Foster

Rinku Gupta

Mike Papka

Arvind Ramanathan

Fangfang Xia

Data

Ian Foster

Robert Underwood

Training

Venkat Vishwanath

Sam Foreman

Evaluation

Franck Cappello

Sandeep Madireddy

Bo Li

Post

Eliu Huerta

Azton Wells

Inference

Rajeev Thakur

Comms

Charlie Catlett

David Martin

Distribution

Brad Ullrich

Teams

Planning
Data Prep
- Accumulate 20+ T tokens of high-quality scientific text and structured data
Models / Training (co-led: Venkat Vishwanath, Sam Foreman)
- Train (entirely from scratch) a series of models on publicly available data
Evaluation
- Skills, trustworthiness, safety, robustness, privacy, machine ethics

Post-Training
- Fine-tuning, alignment
Inference
- Model serving, API development / public-facing web services
Distribution
- Licensing, generating and distributing artifacts for public consumption
Communication

Challenges

This is incredibly difficult in practice, due in part to:

Brand new {hardware, architecture, software}
Lack of native support in existing frameworks (though getting better!)
General system stability +10k Nodes $\left(\times \frac{12\,\mathrm{XPU}}{1\,\mathrm{Node}}\right) \Rightarrow$ $(\times \frac{12 XPU}{1 Node}) \Rightarrow$ +100k XPUs
- network performance
- file system stability (impacted by other users!)
- many unexpected difficulties occur at increasingly large scales
Combinatorial explosion of possible configurations and experiments
- {hyperparameters, architectures, tokenizers, learning rates, …}

AuroraGPT: Training

To train a fixed model on trillions of tokens requires:
1. Aggregating data from multiple different corpora (e.g. ArXiv, Reddit, StackExchange, GitHub, Wikipedia, etc.)
2. Sampling each training batch according to a fixed distribution across corpora
3. Building indices that map batches of tokens into these files (indexing)

The original implementation was slow:

Designed to run serially on a single device
Major bottleneck when debugging data pipeline at scale

AuroraGPT: Blending Data, Efficiently

󰡗 Original implementation:
- Slow (serial, single device)
- ~ 1 hr/2T tokens
󰺇 New implementation:
- Fast! (distributed, asynchronous)
- ~ 2 min/2T tokens (30× faster !!)

Figure: Time spent preparing 2T tokens

Training AuroraGPT-7B on 2T Tokens

Train (grey) and validation (blue) loss vs number of consumed training tokens for AuroraGPT-7B on 64 nodes of Aurora.

Training AuroraGPT-2B on 7T Tokens

(new) Loss vs number of consumed training tokens for AuroraGPT-2B on 256 (blue) and 520 nodes (grey) of Aurora. Both runs show stability through 7T tokens.

Features

 argonne-lcf/Megatron-DeepSpeed

󰯊 Parallelism:
- {data, tensor, pipeline, sequence, …}
 Checkpoint Converters:
- Megatron ⇄ 🤗 HF ⇄ ZeRO ⇄ Universal
 DeepSpeed Integration:
- ZeRO Offloading
- Activation checkpointing
- AutoTP (WIP)
- ability to leverage features from DeepSpeed community

Features (even more!)

🧗 Optimizers (implemented by Marieme Ngom):
- Support for many different optimizers:
  - Distributed Shampoo, Muon, Adopt, Sophia, Lamb, GaLORE, ScheduleFree, …
- See full list
- Large batch training
 Experiment Tracking:
- Automatic experiment and metric tracking with Weights & Biases

MProt-DPO

Finalist: SC’24 ACM Gordon Bell Prize
- MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization (Dharuman et al. 2024)
One of the first protein design toolkits that integrates:
- Text, (protein/gene) sequence, structure/conformational sampling modalities to build aligned representations for protein sequence-function mapping

Scaling Results (2024)

Figure: Scaling results for 3.5B model across ~38,400 GPUs

~ 4 EFLOPS @ Aurora
38,400 XPUs = 3200 [node] × 12 [XPU / node]
 Gordon Bell Finalist:
- MProt-DPO: Breaking the ExaFLOPS Barrier… (Dharuman et al. 2024)

MProt-DPO: Scaling Results

3.5B model

7B model

Loooooooooong Sequence Lengths

Working with  Microsoft/DeepSpeed team to enable longer sequence lengths (context windows) for LLMs
- See my blog post for additional details

25B

33B

Maximum (achievable) SEQ_LEN for both 25B and 33B models (See: Song et al. 2023)

AERIS (2025)

arXiv:2509.13523

Pixel-level Swin diffusion transformer in sizes from [1–80]B

High-Level Overview of AERIS

Rollout of AERIS model, specific humidity at 700m.

Property	Description
Domain	Global
Resolution	0.25° & 1.4°
Training Data	ERA5 (1979–2018)
Model Architecture	Swin Transformer
Speedup	O(10k–100k)

Table: AERIS model + training setup. Speedup relative to PDE-based models (e.g. GFS).

Contributions

 AERIS

First billion-parameter diffusion model for weather + climate

Operates at the pixel level (1 × 1 patch size), guided by physical priors
Medium-range forecast skill:
- Surpasses IFS ENS, competitive with GenCast (Price et al. 2024)
- Uniquely stable on seasonal scales to 90 days

 SWiPe

A novel 3D (sequence-window-pipeline) parallelism strategy for training transformers across high-resolution inputs

Enables scalable small-batch training on large supercomputers
- 10.21 ExaFLOPS
- @ 121,000 Intel XPUs (Aurora)

Demonstrated on up to 120,960 GPUs on Aurora and 8,064 GPUs on LUMI.

Issues with the Deterministic Approach

✕ Transformers:
- Deterministic
- Single input → single forecast

✓ Diffusion:
- Probabilistic
- Single input → ensemble of forecasts
- Captures uncertainty and variability in weather predictions
- Enables ensemble forecasting for better risk assessment

Transitioning to a Probabilistic Model

Reverse diffusion with the input condition, individual sampling steps $t_{0} \rightarrow t_{64}$ , the next time step estimate and the target output.

Reverse Diffusion Process ( $\mathcal{N}\rightarrow \pi$ )

Forward Diffusion Process ( $\pi\rightarrow \mathcal{N}$ )

Sequence-Window-Pipeline Parallelism `SWiPe`

SWiPe is a novel parallelism strategy for Swin-based Transformers
Hybrid 3D Parallelism strategy, combining:
- Sequence parallelism (SP)
- Window parallelism (WP)
- Pipeline parallelism (PP)

SWiPe Communication Patterns

AERIS: Scaling Results

AERIS: Scaling Results

10 EFLOPs (sustained) @ 120,960 GPUs
See (Hatanpää et al. 2025) for additional details
arXiv:2509.13523

Hurricane Laura

Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

References

Dharuman, G., Hippe, K., Brace, A., Foreman, S., et al. (2024). MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization. SC ‘24. doi:10.1109/SC41406.2024.00013
Hatanpää, V., Ku, E., Stock, J., Emani, M., Foreman, S., et al. (2025). AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. arXiv:2509.13523
Price, I., Sanchez-Gonzalez, A., Alet, F., et al. (2024). GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather. arXiv:2312.15796
Song, S.L., Kruft, B., Zhang, M., et al. (2023). DeepSpeed4Science Initiative. arXiv:2310.04610

Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

sam.onl/talks/demo-slides/ 1