 Command

Sam Foreman's personal site. Vim-style keybinds for navigation; theme + font pickers below.

Theme
 Font
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar ⇧C / ⇧E Collapse / expand all sections
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
about: Sam Foreman docs/test: Docs Test ideas: 💡 Ideas about/more: 🪪 More now: Now more: ➕ More posts: 📬 Posts projects: 📚 Projects talks: 🎙️ Talks webtui: Style posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/ezpz-v1: 📝 ezpz-v1 posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/svgbob: 🫥 svgbob posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/demo-slides: AuroraGPT: Training Foundation Models on Supercomputers talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/10: 🍋 ezpz: distributed PyTorch across any hardware posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2026/05/01: Running 50k Python Processes on Aurora with ezpz yeet posts/2026/05/01: ## What it does posts/2026/05/01: ## CLI surface posts/2026/05/01: ### Choosing a local copy method posts/2026/05/01: ### Tarball source posts/2026/05/01: ### Generic (non-venv) sources posts/2026/05/01: ## How it works posts/2026/05/01: ### Local copy + patch posts/2026/05/01: ### Greedy fan-out posts/2026/05/01: ## Scaling on Aurora: 8 → 4096 nodes posts/2026/05/01: ### Two regimes posts/2026/05/01: ### Why tarball broadcast scales so much better than per-file rsync posts/2026/05/01: ## Reproducing posts/2026/05/01: ## Complete workflow posts/2026/05/01: ## See also posts/2026/01/07: 🎉 Happy New Year! posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/2026/04/27: Pre-Training AuroraGPT with TorchTitan posts/2026/04/27: ## Two-Week Summary (Apr 12–27, 2026) posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: Speedrun — 2N, GBS=48, 1000 steps posts/2026/04/27: ### 10B Full Training — 8N, GBS=384, ~3,178 steps posts/2026/04/27: ### Round 4: Reproducible Speedrun — 2N, GAS=8, GBS=384, 1000 steps posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/2026/04/27: ## High-Level posts/2026/04/27: ## Detailed Breakdown posts/2026/04/27: ### Week 1: Apr 12–18 — Benchmarking, LR Finder, XPU Fixes posts/2026/04/27: #### Benchmarking (Apr 12–15) posts/2026/04/27: #### LR Finder (Apr 12–14) posts/2026/04/27: #### Scaling Study (Apr 12) posts/2026/04/27: #### Upstream Syncs (Apr 12–18, syncs 6–14) posts/2026/04/27: #### XPU Bug Fixes (Apr 18) posts/2026/04/27: #### RL Experiment (Apr 18) posts/2026/04/27: ### Week 1.5: Apr 18–25 — Production Readiness posts/2026/04/27: #### Torch 2.12 Benchmarks (Apr 18) posts/2026/04/27: #### LR Finder Extensions (Apr 20–21) posts/2026/04/27: #### XPU Fixes (Apr 23) posts/2026/04/27: #### Torch 2.13 Environment (Apr 25) posts/2026/04/27: #### 2B Scaling Study on Torch 2.13 (Apr 25) posts/2026/04/27: #### Production Training (Apr 25) posts/2026/04/27: ### Week 2: Apr 26–27 — Optimizer Competition posts/2026/04/27: #### RL Multi-Task Refactor (Apr 26) posts/2026/04/27: #### Docs Reorganization (Apr 26) posts/2026/04/27: #### Generic HF Dataset Streaming (Apr 26) posts/2026/04/27: #### New Optimizers (Apr 26) posts/2026/04/27: #### Architecture Tweaks (Apr 26–27) posts/2026/04/27: ## Competition Results posts/2026/04/27: ### Round 1–3: 1000-step speedruns, 2 nodes, GBS=48 (17 configs) posts/2026/04/27: ### Round 4 (10B full training, 8 nodes, GBS=384, 5 configs) posts/2026/04/27: ### Round 5 (2 nodes, GAS=8, GBS=384, local dataset, 8 configs — in progress) posts/2026/04/27: ## Key Discoveries posts/2026/04/27: ## Infrastructure Built posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers talks/2026/06/03: Production Pre-Training at Scale: The Good, the Bad, and the Restarts talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

📸 `flash-attn` on Sunspot

Debugging flash attention discrepancies on Sunspot and documenting framework comparison results with Intel.

Sam Foreman 2024-06-17

Update: 2024-06-16

After an interactive debug session with Intel, the root behavior of the apparent discrepancy was identified.

In particular, we found that the ALCF/Megatron-DeepSpeed repo was NOT explicitly setting the dropout values to 0.0 (and so, was using the default values of 0.1) for both --attention-dropout and --hidden-dropout.

After making this change, the losses were observed to agree, as can be seen below in

Figure 1: After correctly setting the dropout values, the loss curves were observed to agree.

🐛 Impact on Loss [Bug?]

In the q4-drop, it was observed that toggling flash-attn on / off seemed to produce different loss curves (with otherwise identical configs)

shared-config.yaml

TP: 1
PP: 1
GAS: 1
OPT: adamw
dtype: bf16
NLAYERS: 10
MICRO_BATCH: 2
WORLD_SIZE: 24

This can be seen clearly in the figure below:

This was identified, and to be addressed in upcoming release.

📦 LLM Framework Release

On 05/14/2024, Intel dropped their new LLM frameworks release:

🎁 frameworks_2024_5_v2 Announcement:

Hi Venkat,

We have shared the official Q2 release in two different forms :

Manual Setup: /gila/Aurora_deployment/anl_24_q2_release.tar.gz

and

Module:

module use -a /home/jmitche1/anl_release/2024/q2

module load frameworks_2024_5_v2

 Instructions on how to use modules with Q2 build are anl_24_q2_release/README

  • The release includes :
    • Megatron-DeepSpeed 0.14.2 (with patch)
    • Intel® Extension for PyTorch* v2.1.30+xpu
    • TorchCCL 2.1.300
    • ONEAPI 2024.1.0.596.PUBLIC_IDP_2024.1.0_723
    • Agama driver: 803.29
  • The release provides following key features:
    • Scaleup Performance improvement from the TorchCCl prototype feature enabled by TORCH_LLM_ALLREDUCE=1  details
    • Auto TP inference support for more workloads
    • Flash Attention V2 improvement for 256 head dimension support; MiCS support.
    • Latest Features and Optimizations from DeepSpeed 0.14.2 and Intel® Extension for PyTorch* 2.1.30.

Thanks, 
Jerome

📸 flash 🤝 📷 no-flash

With this new release, Intel observed that the loss curves agreed exactly for flash / no-flash, using the learning rate settings below:

lr: 0.00015
lr_warmup_frac: 0.01
lr_decay_iters: 320000

Testing with Jerome’s new release:

module use -a /home/jmitche1/anl_release/2024/q2
module load frameworks_2024_5_v2

I was able to independently confirm these results, shown in 📸 flash 🤝 📷 no-flash below.

🔗 wandb links:

📸 flash vs. 📷 no-flash

`flash` 📸 🤝 📷
no-`flash`

🚧 Broken MPI1

For whatever reason, things seemed to have spontaneously broken on the night of 2024-04-14 ??

When trying to run experiments the following day (05/15/2024) I was met with this2:

Abort(15): Fatal error in internal_Init_thread: Other MPI error

which was discussed further in this thread on slack.

It seems Subrata also encountered a similar issue [see: slack thread]

mpi4py fix

To resolve this

Abort(15): Fatal error in internal_Init_thread: Other MPI error

issue we can simply load the correct modules:

module use -a /home/jmitche1/anl_release/2024/q2
module load frameworks_2024_5_v2
module use /home/ftartagl/graphics-compute-runtime/modulefiles
module load graphics-compute-runtime/agama-ci-devel-803.29
module load spack-pe-gcc/0.6.1-23.275.2 gcc/12.2.0
module use /soft/preview-modulefiles/24.086.0
module load oneapi/release/2024.04.15.001

For full details see mpi4py-reproducer, and this [slack thread].

🕵🏻‍ Framework Comparison

As I was re-building MPI, and after talking to Jerome, I realized that most of the dependencies are already present in the provided frameworks/ modules on Sunspot.

As a simple test, I tried building a new environment built on the base conda environment3 provided by theframeworks/2023.12.15.001 module, which worked without modification and had ) most of what I needed already installed:

>>> import torch
>>> torch.__version__
'2.1.0a0+cxx11.abi'
>>> import intel_extension_for_pytorch as ipex
>>> ipex.__version__
'2.1.10+xpu'
>>> from mpi4py import MPI

The remaining dependencies were installed according to the instructions from the new release frameworks_2024_5_v2.

Details included below.

📦 pip Install Dependencies

Unfortunately, the frameworks/** don’t appear to provide DeepSpeed.

We can create a virtual environment on top of the base conda by

$ module use frameworks/2023.12.15.001
$ export PBS_O_WORKDIR=$(pwd) ; source ALCF/helpers.sh && setup_venv_from_conda

Once the venv has been created and activated, we can install the remaining dependencies:

To build / install DeepSpeed, along with its required dependencies:

  • intel-extension-for-deepspeed:

    python3 -m pip install intel_extension_for_pytorch_deepspeed\=\=2.1.30 -f "https://pytorch-extension.intel.com/release-whl/stable/xpu/us/intel-extension-for-pytorch-deepspeed/"
  • DeepSpeed:

    echo "build deepspeed"
    git clone https://github.com/microsoft/DeepSpeed.git
    cd DeepSpeed
    git remote add yizhou_ds https://github.com/YizhouZ/DeepSpeed.git
    git fetch yizhou_ds
    git checkout yizhou/kernel_path
    pip install -r requirements/requirements.txt
    python setup.py develop |& tee build.log
  • Extras:

    python3 -m pip install transformers datasets python-etcd tensorboardX packaging sentencepiece bitsandbytes tiktoken neural-speed einops intel-extension-for-transformers

Looking around the available modules a bit, I noticed a newer frameworks release (frameworks/2024.04.15.002) that had a newer version of both torch and ipex:

module use /soft/preview-modulefiles/24.086.0
module load frameworks/2024.04.15.002.lua
python3 -c 'from mpi4py import MPI; print(MPI.__file__)'
# /soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so
>>> import torch
>>> torch.__version__
'2.1.0.post2+cxx11.abi'
>>> import intel_extension_for_pytorch as ipex
>>> ipex.__version__
'2.1.30+xpu'
>>> from mpi4py import MPI; print(MPI.__file__)
/soft/datascience/aurora_nre_models_frameworks-2024.1_preview_u1/lib/python3.9/site-packages/mpi4py/MPI.cpython-39-x86_64-linux-gnu.so

The remaining dependencies were installed identically to what was just done previously for the frameworks/2023.12.15.001 module.

NOTE: In the figures below, we denote these two environments as:

  • 2024.0:
    • module load frameworks/2023.12.15.001
  • 2024.1:
    • module use /soft/preview-modulefiles/24.086.0
    • module load frameworks/2024.04.15.002.lua
  • anl_24_q2_release:
    • eval "$(~/miniconda3/bin/conda shell.zsh hook)"
    • conda activate anl_24_q2_release

🥸 Fix in Disguise

Armed now with functional environment(s) for argonne-lcf/Megatron-DeepSpeed, I was able to resume my previous experiments.

From the discussion with Intel, it was hard to understand / reason about why the flash-attn fix would have any dependence on the learning rate schedule (warmup + decay).

If the flash-attn fix works for a particular learning rate schedule, you would reasonably expect that it should work for any learning rate schedule.

An additional source of confusion for me was that the discrepancy in the loss curves (seemingly) disappeared when using the learning rate settings provided by Intel4, but not when using the ALCF defaults5.

After thinking about it for a bit and trying to reason about possible causes, I wondered if it might not be a mix of multiple different factors:

  1. Small learning rate
  2. Very long decay
  3. [maybe ?] somehow dependent on the learning rate warmup fraction
    1. preliminary experiments seemed to suggest this was not the case

So, I was curious what would happen if I used the (larger) learning rate value from the ALCF defaults (lr=0.003) with the very long lr-decay-iters: 320000 from Intel.

These results are shown below.

In particular, for all three experiments the following learning rate settings were used:

lr: 0.0003
lr-warmup-frac: 0.05
lr-decay-iters: 320000
flash-attn-disguise-decay10000-1

Looking at this figure ^, it appears that up until the very very end, all three loss curves agree identically.

However, if we look closely at the very end, it looks like there might be a slight difference beginning to appear between the 2024.0 (brown line) and {anl_24_q2_release, 2024.1} ({dark, light} blue lines, respectively).

Thinking that I might be onto something, I then tried again with a smaller lr-decay-iters: 5000.

This result is shown below:

flash-attn-disguise-decay5000

In particular, we can now more clearly see the difference beginning to appear between the 2024.0 and 2024.1 loss curves.

Continuing on, we see this effect become increasingly dramatic with even smaller values of lr-decay-iters:

flash-attn-disguise-decay-2000 flash-attn-disguise-decay1500 flash-attn-disguise-decay1000-1

In each of these experiments, it appears that:

  • 2024.0:
    • Not impacted by this lr-decay-iters dependence
    • Continue to decrease for the duration of training
  • 2024.1:
    • Impacted by the lr-decay-iters dependence
    • Plateaus towards the end of training
Older Figsdisguised-fix-2disguised-fix-1

2024.0 Fix

Everything seems to work with

module load frameworks/2023.12.15.001

📊 lr-decay-iters Comparison

  • 2024.0:
  • 2024.1:

📈 lr-decay-iters dependence

🏎️ Performance Improvement in 2024.1

lr: 0.0003
lr-warmup-frac: 0.05
lr-decay-iters: null

Footnotes

  1. Gremlins, likely

  2. https://github.com/pmodels/mpich/pull/7001

  3. Explicitly, aurora_nre_models_frameworks-2024.0, abbreviated as 2024.0

  4. Intel used the following learning rate schedule in their experiments yml lr: 0.00015 lr-warmup-frac: 0.01 lr-decay-iters: 320000

  5. ALCF used the following learning rate schedule in their experiments

NORMAL  main  sam.onl/ /posts/AuroraGPT/flash-attn-sunspot/ · Top 1:1