Training Foundation Models on Supercomputers

Sam Foreman Oct 24, 2025 10/24/25 2 min read

 

Sam Foreman 2025-10-24

🧑🏻‍💻 About Me
Argonne Leadership Computing Facility (ALCF)
- 🏗️ Aurora
- 🤖 ALCF AI Testbed
🌌 AuroraGPT (2024–)
🧬 MProt-DPO
🌎 AERIS (2025)
📓 References
❤️ Acknowledgements
Extras

🧑🏻‍💻 About Me

🏡 samforeman.me
UIUC (2015):
- Engineering Physics + Applied Mathematics
University of Iowa (2015–2019):
- PhD. Physics¹
ANL (2019–2022): Postdoctoral Researcher
ANL (2022–Present): Assistant Computational Scientist
- Member of the AI/ML Group at ALCF

Current Research:

AuroraGPT: Foundation Models for Science
AERIS: Argonne’s Earth System Model
- Finalist for the 2025 ACM Gordon Bell Prize in Climate Modeling
MProt-DPO: Multimodal Protein Design
- Finalist for the ACM Gordon Bell Prize 2024
GenSLMs: Genome Scale Language Models.
- Winner of the ACM Gordon Bell Special Prize for HPC-Based COVID-19 Research

Argonne Leadership Computing Facility (ALCF)

The ALCF enables breakthroughs in science and engineering by providing supercomputing resources and expertise to the research community. –alcf.anl.gov

$Forward Diffusion Process (\pi\rightarrow \mathcal{N})$

🌀 Sequence-Window-Pipeline Parallelism `SWiPe`

SWiPe is a novel parallelism strategy for Swin-based Transformers
Hybrid 3D Parallelism strategy, combining:
- Sequence parallelism (SP)
- Window parallelism (WP)
- Pipeline parallelism (PP)

Figure 17

Figure 18: SWiPe Communication Patterns

🚀 AERIS: Scaling Results

Figure 19: AERIS: Scaling Results

10 EFLOPs (sustained) @ 120,960 GPUs
See (Hatanpää et al. (2025)) for additional details
arXiv:2509.13523

🌪️ Hurricane Laura

Figure 20: Hurricane Laura tracks (top) and intensity (bottom). Initialized 7(a), 5(b) and 3(c) days prior to 2020-08-28T00z.

📓 References

Dharuman, Gautham, Kyle Hippe, Alexander Brace, et al. 2024. “MProt-DPO: Breaking the ExaFLOPS Barrier for Multimodal Protein Design Workflows with Direct Preference Optimization.” Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Atlanta, GA, USA), SC ’24. https://doi.org/10.1109/SC41406.2024.00013.

Hatanpää, Väinö, Eugene Ku, Jason Stock, et al. 2025. AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions. https://arxiv.org/abs/2509.13523.

Price, Ilan, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. 2024. GenCast: Diffusion-Based Ensemble Forecasting for Medium-Range Weather. https://arxiv.org/abs/2312.15796.

Song, Shuaiwen Leon, Bonnie Kruft, Minjia Zhang, et al. 2023. DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery Through Sophisticated AI System Technologies. https://arxiv.org/abs/2310.04610.

❤️ Acknowledgements

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.