 Command
Site Info

sam.onl is a terminal-style knowledge base and notes hub, built with Astro + WebTUI. Use the keybinds below to move between navbar, content, and sidebars, then customize the look with the theme picker.

Theme
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
landing: Sam Foreman docs/test: Docs Test about: 🪪 About more: ➕ More posts: 📬 Posts projects: 📚 Projects now: Now talks: 🎙️ Talks posts/auroragpt: 🤖 AuroraGPT ideas: 💡 Ideas posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/ezpz-v1: 📝 ezpz-v1 posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora posts/svgbob: 🫥 svgbob posts/2025: 📆 2025 talks/auroragpt-siam25: AuroraGPT talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/ai-for-science-2024: Parallel Training Methods talks/hpc-user-forum/auroragpt: AuroraGPT talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/installation/vite: Vite webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/2025/06: 06 posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model talks/2025/10/24: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers talks/2025/10/15: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris
 Theme Current: Light j/k or ↑/↓ + Enter

⏱️ Comparing Launchers on Aurora

torchrun

  • Comparing torchrun

ezpz

  • Got through 190 steps in ~49s before I hit the kill switch

    #[aurora_frameworks-2025.3.1](torchtitan-aurora_frameworks-2025.3.1)[38s]
    #[02/28/26,14:29:02][x4020c0s0b0n0][/f/A/f/p/s/torchtitan][ezpz][?]
    ;  ezpz launch -n 2 -np 2 -nh 1 -- python3 -m torchtitan.experiments.ezpz.train --module ezpz.agpt --config ezpz_agpt_debugmodel
    
    
    [2026-02-28 14:29:11,095723][I][ezpz/launch:421:launch] ----[🍋 ezpz.launch][started][2026-02-28-142911]----
    [2026-02-28 14:29:11,101835][I][ezpz/launch:66:_log_json_log_file] Logs available at: /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/torchtitan/logs/torchtitan.experiments.ezpz.train/2026-02-28-142910-rank0.jsonl
    [2026-02-28 14:29:13,015798][I][ezpz/launch:442:launch] Job ID: 8359703
    [2026-02-28 14:29:13,016610][I][ezpz/launch:443:launch] nodelist: ['x4020c0s0b0n0', 'x4020c0s1b0n0']
    [2026-02-28 14:29:13,017029][I][ezpz/launch:444:launch] hostfile: /var/spool/pbs/aux/8359703.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    [2026-02-28 14:29:13,017705][I][ezpz/pbs:273:get_pbs_launch_cmd] ⚠️ Using [2/24] GPUs [1 hosts] x [2 GPU/host]
    [2026-02-28 14:29:13,018247][W][ezpz/pbs:279:get_pbs_launch_cmd] [🚧 WARNING] Using only [2/24] available GPUs!!
    [2026-02-28 14:29:13,019057][I][ezpz/launch:391:build_executable] Building command to execute by piecing together:
    [2026-02-28 14:29:13,019475][I][ezpz/launch:392:build_executable] (1.) launch_cmd: mpiexec --envall --np=2 --ppn=2 --hostfile=/var/spool/pbs/aux/8359703.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    [2026-02-28 14:29:13,020124][I][ezpz/launch:393:build_executable] (2.) cmd_to_launch: python3 -m torchtitan.experiments.ezpz.train --module ezpz.agpt --config ezpz_agpt_debugmodel
    [2026-02-28 14:29:13,020780][I][ezpz/launch:460:launch] Took: 2.54 seconds to build command.
    [2026-02-28 14:29:13,021143][I][ezpz/launch:463:launch] Executing:
    mpiexec
    --envall
    --np=2
    --ppn=2
    --hostfile=/var/spool/pbs/aux/8359703.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
    --no-vni
    --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96
    python3
    -m
    torchtitan.experiments.ezpz.train
    --module
    ezpz.agpt
    --config
    ezpz_agpt_debugmodel
    [2026-02-28 14:29:13,022468][I][ezpz/launch:242:get_aurora_filters] Filtering for Aurora-specific messages. To view list of filters, run with EZPZ_LOG_LEVEL=DEBUG
    [2026-02-28 14:29:13,022990][I][ezpz/launch:470:launch] Execution started @ 2026-02-28-142913...
    [2026-02-28 14:29:13,023408][I][ezpz/launch:160:run_command] Caught 24 filters
    [2026-02-28 14:29:13,023770][I][ezpz/launch:161:run_command] Running command:
    mpiexec --envall --np=2 --ppn=2 --hostfile=/var/spool/pbs/aux/8359703.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov --no-vni --cpu-bind=verbose,list:2-4:10-12:18-20:26-28:34-36:42-44:54-56:62-64:70-72:78-80:86-88:94-96 python3 -m torchtitan.experiments.ezpz.train --module ezpz.agpt --config ezpz_agpt_debugmodel
    cpubind:list x4020c0s0b0n0 pid 46486 rank 0 0: mask 0x1c
    cpubind:list x4020c0s0b0n0 pid 46487 rank 1 1: mask 0x1c00
    Using [2 / 24] available "xpu" devices !!
    [num_nodes_from_hostfile=2]vs.[1] (wsa=2 // gpus_per_node=12)
    [titan] 2026-02-28 14:29:17,872 - root - INFO - torchtitan version: 0.0.0+unknown (0.0.0 means __version__ is not defined correctly).
    [2026-02-28 14:29:20,883348][W][distributed/utils:298:init_distributed] torch.distributed is already initialized. Skipping init_distributed. The provided comm_config and other settings will not take effect.
    [2026-02-28 14:29:20,885358][I][distributed/parallel_dims:131:build_mesh] Building device mesh with parallelism: pp=1, dp_replicate=1, dp_shard=2, cp=1, tp=1, ep=1, etp=1
    [2026-02-28 14:29:20,891663][I][distributed/parallel_dims:184:build_mesh] Successfully created meshes with active dimensions: ['batch', 'loss', 'fsdp', 'efsdp']
    [2026-02-28 14:29:20,892713][I][tools/utils:73:collect] [GC] Initial GC collection took 0.00 seconds
    2026:02:28-14:29:20:46486 |CCL_WARN| value of CCL_LOG_LEVEL changed to be error (default:warn)
    [2026-02-28 14:29:21,155062][I][components/tokenizer:117:_load_tokenizer_from_path] Loading tokenizer from tokenizer.json
    [2026-02-28 14:29:21,163560][I][hf_datasets/text_datasets:66:_validate_dataset] Preparing c4_test dataset from tests/assets/c4_test
    [2026-02-28 14:29:21,687267][I][ft/trainer:116:__init__] Building ezpz.agpt debugmodel with {
    "dim": 256,
    "n_layers": 6,
    "vocab_size": 2048,
    "norm_eps": 1e-05,
    "rope": {
        "dim": 16,
        "max_seq_len": 2048,
        "theta": 500000,
        "backend": "complex",
        "scaling": "llama",
        "scaling_factor": 8.0,
        "low_freq_factor": 1.0,
        "high_freq_factor": 4.0,
        "original_max_position_embeddings": 8192,
        "rope_factor": 1.0,
        "beta_fast": 32.0,
        "beta_slow": 1.0,
        "original_seq_len": 4096,
        "mscale": 0.0
    },
    "layer": {
        "norm_eps": 1e-05,
        "attention": {
        "n_heads": 16,
        "attn_backend": "sdpa",
        "attn_mask_type": "causal",
        "n_kv_heads": null,
        "head_dim": null,
        "qk_norm": false,
        "norm_eps": 1e-05,
        "bias": false,
        "use_rope": true,
        "rope_backend": "complex"
        },
        "feed_forward": {
        "hidden_dim": 768,
        "bias": false
        },
        "moe": null,
        "depth_init": true
    }
    }
    [2026-02-28 14:29:21,701601][I][components/metrics:98:build_device_memory_monitor] XPU capacity: Intel(R) Data Center GPU Max 1550 with 63.98GiB memory
    [2026-02-28 14:29:21,746679][I][ft/trainer:153:__init__] Model ezpz.agpt debugmodel size: 6,163,712 total parameters
    [2026-02-28 14:29:21,747548][I][distributed/activation_checkpoint:249:apply_ac] Applied selective activation checkpointing to the model
    [2026-02-28 14:29:21,763901][I][agpt/parallelize:126:parallelize_llama] Applied FSDP to the model
    [2026-02-28 14:29:22,109145][I][ft/trainer:268:__init__] Peak FLOPS used for computing MFU: 2.982e+14
    [2026-02-28 14:29:22,109976][I][ft/trainer:270:__init__] XPU memory usage for model: 0.02GiB(0.02%)
    [2026-02-28 14:29:22,110889][W][protocols/state_dict_adapter:96:__init__] model.safetensors.index.json not found at hf_assets_path: ./tests/assets/tokenizer/model.safetensors.index.json.                     Defaulting to saving a single safetensors file if checkpoint is saved in HF format
    [2026-02-28 14:29:22,113986][I][distributed/utils:248:maybe_enable_amp] Mixed precision training is handled by fully_shard
    [2026-02-28 14:29:22,114507][I][ft/trainer:364:__init__] Trainer is initialized with local batch size 8, global batch size 16, gradient accumulation steps 1, sequence length 2048, total steps 10000 (warmup 200)
    [2026-02-28 14:29:22,115081][I][ft/trainer:523:train] Training starts at step 1
    [2026-02-28 14:29:23,662786][I][components/metrics:515:log] step:  1  loss:  8.13834  grad_norm:  1.3149  memory:  8.29GiB(12.96%)  tps: 8,550  tflops: 0.61  mfu: 0.21%
    [2026-02-28 14:29:23,663958][I][distributed/utils:379:set_pg_timeouts] Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
    /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/torchtitan/torchtitan/distributed/utils.py:395: UserWarning: Set timeout is now only supported for either nccl or gloo.
    torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
    /lus/flare/projects/AuroraGPT/foremans/projects/saforem2/torchtitan/torchtitan/distributed/utils.py:395: UserWarning: Set timeout is now only supported for either nccl or gloo.
    torch.distributed.distributed_c10d._set_pg_timeout(timeout, group)
    [2026-02-28 14:29:25,055428][I][components/metrics:515:log] step: 10  loss:  8.02783  grad_norm:  1.3883  memory:  8.30GiB(12.97%)  tps: 105,974  tflops: 7.59  mfu: 2.54%
    [2026-02-28 14:29:26,584187][I][components/metrics:515:log] step: 20  loss:  7.58133  grad_norm:  1.5400  memory:  8.30GiB(12.97%)  tps: 107,242  tflops: 7.68  mfu: 2.57%
    [2026-02-28 14:29:28,116840][I][components/metrics:515:log] step: 30  loss:  6.33049  grad_norm:  2.3625  memory:  8.30GiB(12.97%)  tps: 106,966  tflops: 7.66  mfu: 2.57%
    [2026-02-28 14:29:29,637456][I][components/metrics:515:log] step: 40  loss:  4.70528  grad_norm:  2.4628  memory:  8.30GiB(12.97%)  tps: 107,811  tflops: 7.72  mfu: 2.59%
    [2026-02-28 14:29:31,026418][I][tools/utils:73:collect] [GC] Performing periodic GC collection took 0.01 seconds
    [2026-02-28 14:29:31,177200][I][components/metrics:515:log] step: 50  loss:  3.81425  grad_norm:  1.8086  memory:  8.30GiB(12.97%)  tps: 106,472  tflops: 7.62  mfu: 2.56%
    [2026-02-28 14:29:32,701515][I][components/metrics:515:log] step: 60  loss:  3.30735  grad_norm:  0.9320  memory:  8.30GiB(12.97%)  tps: 107,548  tflops: 7.70  mfu: 2.58%
    [2026-02-28 14:29:34,235587][I][components/metrics:515:log] step: 70  loss:  3.01942  grad_norm:  0.5519  memory:  8.30GiB(12.97%)  tps: 106,864  tflops: 7.65  mfu: 2.57%
    [2026-02-28 14:29:35,795588][I][components/metrics:515:log] step: 80  loss:  2.96857  grad_norm:  0.3655  memory:  8.30GiB(12.97%)  tps: 105,089  tflops: 7.52  mfu: 2.52%
    [2026-02-28 14:29:37,355094][I][components/metrics:515:log] step: 90  loss:  2.83759  grad_norm:  0.4565  memory:  8.30GiB(12.97%)  tps: 105,120  tflops: 7.53  mfu: 2.52%
    [2026-02-28 14:29:38,736377][I][tools/utils:73:collect] [GC] Performing periodic GC collection took 0.01 seconds
    [2026-02-28 14:29:38,890440][I][components/metrics:515:log] step: 100  loss:  2.83428  grad_norm:  0.4375  memory:  8.30GiB(12.97%)  tps: 106,776  tflops: 7.64  mfu: 2.56%
    [2026-02-28 14:29:40,417778][I][components/metrics:515:log] step: 110  loss:  2.79788  grad_norm:  0.3403  memory:  8.30GiB(12.97%)  tps: 107,334  tflops: 7.68  mfu: 2.58%
    [2026-02-28 14:29:40,424763][W][hf_datasets/text_datasets:138:__iter__] Dataset c4_test is being re-looped
    [2026-02-28 14:29:41,939798][I][components/metrics:515:log] step: 120  loss:  2.93548  grad_norm:  0.4606  memory:  8.30GiB(12.97%)  tps: 107,710  tflops: 7.71  mfu: 2.59%
    [2026-02-28 14:29:43,461210][I][components/metrics:515:log] step: 130  loss:  2.82229  grad_norm:  0.3851  memory:  8.30GiB(12.97%)  tps: 107,753  tflops: 7.71  mfu: 2.59%
    [2026-02-28 14:29:44,983914][I][components/metrics:515:log] step: 140  loss:  2.71684  grad_norm:  0.3783  memory:  8.30GiB(12.97%)  tps: 107,665  tflops: 7.71  mfu: 2.58%
    [2026-02-28 14:29:46,352718][I][tools/utils:73:collect] [GC] Performing periodic GC collection took 0.01 seconds
    [2026-02-28 14:29:46,504471][I][components/metrics:515:log] step: 150  loss:  2.84511  grad_norm:  0.4313  memory:  8.30GiB(12.97%)  tps: 107,815  tflops: 7.72  mfu: 2.59%
    [2026-02-28 14:29:48,029670][I][components/metrics:515:log] step: 160  loss:  2.79779  grad_norm:  0.4913  memory:  8.30GiB(12.97%)  tps: 107,485  tflops: 7.69  mfu: 2.58%
    [2026-02-28 14:29:49,554232][I][components/metrics:515:log] step: 170  loss:  2.78644  grad_norm:  0.6984  memory:  8.30GiB(12.97%)  tps: 107,531  tflops: 7.70  mfu: 2.58%
    [2026-02-28 14:29:51,071728][I][components/metrics:515:log] step: 180  loss:  2.73290  grad_norm:  0.4409  memory:  8.30GiB(12.97%)  tps: 108,033  tflops: 7.73  mfu: 2.59%
    [2026-02-28 14:29:52,595778][I][components/metrics:515:log] step: 190  loss:  2.71296  grad_norm:  0.3844  memory:  8.30GiB(12.97%)  tps: 107,576  tflops: 7.70  mfu: 2.58%
    ^C
    Aborted!
    [1]    46313 exit 1     ezpz launch -n 2 -np 2 -nh 1 -- python3 -m torchtitan.experiments.ezpz.train
    took: 49s