 Command
Site Info

sam.onl is a terminal-style knowledge base and notes hub, built with Astro + WebTUI. Use the keybinds below to move between navbar, content, and sidebars, then customize the look with the theme picker.

Theme
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
landing: Sam Foreman about: 🪪 About docs/test: Docs Test ideas: 💡 Ideas now: Now more: ➕ More projects: 📚 Projects posts: 📬 Posts talks: 🎙️ Talks posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/ezpz-v1: 📝 ezpz-v1 posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora posts/svgbob: 🫥 svgbob talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/installation/vite: Vite webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

🚂 Loooooooong Sequence Lengths

Sam Foreman 2024-02-12

Figure 1: This work was done as part of the DeepSpeed4Science project, in collaboration with Microsoft.

The new Megatron-DeepSpeed release contains a variety of improvements / optimizations to enable pre-training Transformer based architectures with significantly longer sequences than was previously possible.

📓 Note:

Additional details can be found in the 📁 DeepSpeed4Science folder.

DeepSpeed4Science (09/2023)

New Features

  • Enabled Megatron-LM’s sequence parallel.

  • Enabled rotary positional embedding.

  • Enabled FlashAttention v1 and v2.

  • Enabled new fused kernels from NVIDIA.

New optimizations

  • Enabled attention map memory optimization, where we first generated attention mask on CPU memory and then moved it into GPU memory to avoid out-of-memory errors when training with very large sequence lengths.

  • Position embedding partitioning, where we split weights of position encoding across all GPUs when enabling sequence parallel to further reduce the memory footprint.

Initial Results

Table 1: Long sequence length support1 from microsoft/Megatron-DeepSpeed

Sequence LengthOld Megatron-DeepSpeed (TFLOPS)New Megatron-DeepSpeed (TFLOPS)
2k2568
4k2880
8kOOM86
16kOOM92
32kOOM100
64kOOM106
128kOOM119
256kOOM94
Data
gpus = ('32', '64', '128')

colors = {
    'Old Megatron-DS': '#FF5252',
    'Megatron-LM': '#76b900',
    'New Megatron-DS':  '#1A8FFF',
}

data = {
    '25B': {
        'Old Megatron-DS': np.array([36, 42, 42]),
        'Megatron-LM': np.array([26, 48, 52]),
        'New Megatron-DS': np.array([192, 448, 512]),
    },
    '33B': {
        'Old Megatron-DS': np.array([28, 32, 32]),
        'Megatron-LM': np.array([14, 46, 52]),
        'New Megatron-DS': np.array([128, 384, 448]),
    },
}
Make the plots
x = np.arange(len(gpus))
width = 0.25
multiplier = 0

outdir = Path(os.getcwd()).joinpath('assets')
outdir.mkdir(exist_ok=True, parents=True)

improvement = {}
for idx, (model_size, d) in enumerate(data.items()):
    multiplier = 0
    figure, axes = plt.subplots(figsize=(7.5, 4))
    fig = plt.gcf()
    ax = plt.gca()
    for label, value in d.items():
        offset = width * multiplier
        rects = ax.barh(
          x + offset,
          value,
          width,
          label=label,
          color=colors[label],
          alpha=0.8
        )
        ax.bar_label(
          rects,
          padding=3,
          color=colors[label],
          family='monospace',
          weight='bold'
        )
        multiplier += 1
    ax.set_ylabel(
        'GPUs',
        fontsize=18,
        family='sans-serif',
        loc='center',
    )
    ax.set_yticks(x + width, gpus)
    plt.figtext(
        0.005, 0.93, f"{model_size}", fontsize=24, fontweight='bold', ha='left'
    )
    ax.set_xlabel(
        'Sequence Length (k)', fontsize=18, loc='center'
    )
    ax.legend(
        bbox_to_anchor=(0.005, 1.04, 0.99, .098),
        alignment='center',
        edgecolor="#83838320",
        frameon=True,
        ncols=3,
        fontsize=13,
        mode="expand",
        borderaxespad=0.01
    )
    save_figure(fname=f'{model_size}', outdir=outdir)
    _ = plt.show()

GPT-25B Model

GPT-33B Model

Figure 2: Pre-training with long sequence support across different model sizes and numbers of GPUs. In each case, the new (current) implementation significantly outperforms both NVIDIA/Megatron-LM as well as our previous implementation.

Installation

Using install.sh

Installation

Important
To install, simply:

git clone https://github.com/ramanthanlab/GenSLM/
cd GenSLM/examples/long-sequences/
./install.sh

Explicitly, ./install.sh will:

  1. Automatically create a virtual environment on top of the latest conda module
  2. Install (+ update2) / build all the required dependencies into this virtual environment

Step-by-Step

For completeness, we describe below the steps for installing and building each of the dependencies.

  1. Clone GitHub repo:

    git clone https://github.com/ramanthanlab/GenSLM
  2. Load conda module:

    • ThetaGPU:

      # ThetaGPU:
      if [[ "$(hostname)==theta*" ]]; then
          export MACHINE="ThetaGPU"
          export CONDA_DATE="2023-01-10"
          module load conda/2023-01-11
          conda activate base
      fi
    • Polaris:

      # Polaris:
      if [[ "$(hostname)==x3*" ]]; then
          export MACHINE="Polaris"
          export CONDA_DATE="2023-01-10"
          module load conda/2023-01-10-unstable
          conda activate base
      fi
  3. Setup Virtual Environment3:

    cd ./genslm/examples/long-sequences
    # create a new virtual environment
    mkdir -p "venvs/${MACHINE}/${CONDA_DATE}"
    python3 -m venv "venvs/${MACHINE}/${CONDA_DATE}" --system-site-packages
    source "venvs/${MACHINE}/${CONDA_DATE}/bin/activate"
  4. Create a new folder (genslm/examples/long-sequences/deps/${MACHINE}) where we’ll installing dependencies locally:

    mkdir -p "deps/${MACHINE}"
    cd "deps/${MACHINE}"

Dependencies

We provide below the details needed to install each of the required dependencies.

saforem2/ezpz
  1. saforem2/ezpz

    pip install -e "git+https://github.com/saforem2/ezpz.git#egg=ezpz"
Microsoft/DeepSpeed
  1. Microsoft/DeepSpeed

    git clone https://github.com/microsoft/DeepSpeed.git
    cd DeepSpeed
    python3 -m pip install -e .
Microsoft/Megatron-DeepSpeed
  1. Microsoft/Megatron-DeepSpeed:

    git clone https://github.com/microsoft/Megatron-DeepSpeed.git
NVIDIA/apex
  1. NVIDIA/apex

    git clone https://github.com/NVIDIA/apex
    cd ../apex/
    pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" -e ./
pybind/PyBind11
  1. pybind/PyBind11

    pip install pybind11
Dao-AILab/flash-attention
  1. Dao-AILab/flash-attention:

    Flash Attention
    • The new release supports three different implementations of FlashAttention: (v1.0.4, v2.x, triton)
    • FlashAttention v2.x may have numerical instability issues. For the best performance, we recommend using FlashAttention + Triton
    • v1.0.4:

      python3 -m pip install flash-attn==1.0.4
    • v2.x:

      git clone https://github.com/Dao-AILab/flash-attention
      cd flash-attention
      python3 setup.py install
    • openai/triton:

      git clone -b legacy-backend https://github.com/openai/triton
      cd triton/python
      python3 -m pip install cmake
      python3 -m pip install .

Running

The ALCF/ directory contains shell scripts for setting up the environment and specifying the options to be used when launching.

Various options can be specified dynamically at runtime by setting them in your environment, e.g.:

MODEL_SIZE_KEY="GPT25B" SEQ_LEN=128000 USE_FLASH_ATTN=1 MICRO_BATCH=1 GAS=1 SP_TYPE="megatron" ZERO_STAGE=1 ./ALCF/train-gpt3.sh

Explicitly:

  • ALCF/train-gpt3.sh: Main entry point for training
    • This script will automatically source the rest of the required ALCF/*.sh scripts below
  • ALCF/models.sh: Contains some example model architectures for GPT3-style models
  • ALCF/args.sh: Logic for parsing / setting up runtime options for Megatron and DeepSpeed
  • ALCF/setup.sh: Locate and activate virtual environment to be used, ensure MPI variables are set properly
  • ALCF/launch.sh: Identify available resources and build the command to be executed
    • i.e. figure out how many: {nodes, GPUs per node, GPUs total}, to pass to mpi{run,exec}
    • then, use this to build mpiexec <mpiexec-args> python3 pretrain_gpt.py

ZeRO Offloading

🚀 W&B Report: Looooooooong Sequences

These newly introduced optimizations, in combination with ZeRO-Offload allows us to go even further.

By employing ZeRO-Offloading, we are able to free up additional memory which can be used for even longer sequences.

Though work is still ongoing, this is a promising direction that will allow us to consider significantly larger genomes than previously possible.

We use Weights & Biases to track these experiments, and have aggregated our initial results in the W&B Report below.

We can evaluate the performance of our model by looking at two different metrics for throughput: samples_per_sec and TFLOPS.

Explicitly, we see that we are able to scale up to significantly longer sequences (420k / 128k ~ 3.3x) with only a minimal impact on throughput performance (81 / 105 ~ 77\%)4.

Table 2: Impact on TFLOPS as a function of increasing sequence length. Table from: throughput/TFLOPS

NameSequence Length (k)(seq_len / min_seq_len)TFLOPSTFLOPS (% of peak)
GPT25B4203.2812581.7722577.867
GPT25B4003.12590.6286.297
GPT25B3602.812581.632577.7348
GPT25B3602.812582.682478.7346
GPT25B1921.5115.8228110.2927
GPT25B1281106.672101.5788
GPT25B1281105.014100.00

<iframe src=“https://wandb.ai/l2hmc-qcd/Megatron-DS-Benchmarking/reports/Looooooong-Sequences—Vmlldzo1MzI2NjA1” style=“border:none;height:1024px;width:100%”>

</iframe>

Figure 3: Weights & Biases Report

Footnotes

  1. The described experiments were performed on 4 NVIDIA DGX A100-40GB nodes, all using TPSIZE=32[^tpsize], connected through 8 HDR InfiniBand (200Gb/s per HDR).

    1. deepspeed-0.10.3
    2. pytorch==2.0.0+cu118
  2. Where "${MACHINE}" \in {"ThetaGPU", "Polaris"} and "${CONDA_DATE}" \in {"2023-01-10", "2023-01-11"}

  3. throughput/TFLOPS