 Command
Site Info

sam.onl is a terminal-style knowledge base and notes hub, built with Astro + WebTUI. Use the keybinds below to move between navbar, content, and sidebars, then customize the look with the theme picker.

Theme
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
landing: Sam Foreman about: 🪪 About docs/test: Docs Test ideas: 💡 Ideas now: Now more: ➕ More projects: 📚 Projects posts: 📬 Posts talks: 🎙️ Talks posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/ezpz-v1: 📝 ezpz-v1 posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora posts/svgbob: 🫥 svgbob talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/installation/vite: Vite webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

🐢 Starting Up Distributed Training on Aurora

Sam Foreman 2024-03-21

Tip
  • Request:

    Hi Sam and Corey,

    Thanks for your comments on measuring the application start up time last week.

    Typically, we report the throughput performance after the start-up and warm-up during the “steady” state of the training.

    We have a few follow-up questions so that we establish a methodology to address the issue brought up by Argonne.

    1. We can set a few timestamps in the model scripts and job scripts used for the queue submission: Job script:

      Time stamp A:  
      &lt;actual python command using mpiexec>
      
      Inside the model script:  
      main()  
      Timestamp B:  
      [...]
      Timestamp C:  
      First training steps and onwards.  

      By startup time, do you mean measuring time difference between A and C or B and C?

    1. Will the measurement methodology be the same for distributed training?
      For examples, we can measure the start-up time for the rank0?
    1. If we need to report the startup time for the DL applications, do we need to collect measurements using the actual Aurora NRE workloads or some small benchmarking test cases?

      For example, we can try to recreate the typical start-up scenarios, like library imports, and measure those separately as shown below.

      Job script:
      Time stamp A:
      &lt;actual python command using mpiexec>
      
      Time stamp B:
       import torch
      Time stamp C
      import IPEX
      Time stamp D
      Etc...

      If you have any other scenarios, please feel free to suggest.

Response

  1. In Measuring / Calculating Startup Time,I provide a summary of how the startup time is identified and calculated.

  2. I’m not sure exactly I understand

    Will the measurement methodology be the same for distributed training? For examples, we can measure the start-up time for the rank0?

    The startup time is being measured for distributed training (logs only created on RANK = 0)

  3. I discuss in Minimal Working Example a minimal example that can be used to measure the startup times.

    • This is using a library I’ve been working on, ezpz that is designed to help simplify the process of setting up / initializing distributed training across many GPUs.

Measuring / Calculating Startup Time

  • The startup timing was identified by parsing the logfiles from existing runs and calculating the difference δt=t1t0\delta t = t_{1} - t_{0},

    • t0t_{0} is the time stamp at the very beginning of the shell script (defined here) which then launches

      mpiexec ${mpi-args} python3 [...]
      • t0t_{0} appears in the logfile as:

        Job started at: 2023-11-02-183323 on x3004c0s13b0n0
    • t1t_{1} is identified as the timestamp associated with the completion of the first training step

      • t1t_{1} appears in the logfile as:

        [2023-11-02 18:34:13,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
  • Below is an example of the bash script use to parse the logfiles and identify these timestamps:

      $ for f in $(tail -5 logfiles) ; do echo $f; cat $f | grep -E "Job started|step=0\," | uniq ; echo "\n" ; done
      /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_actCkpt_GPT1T_4L_z1_seqlen2048_mp8_pp2_sp1_nl4_hs25600_gb16_mb1/logs/foremans-x3004c0s13b0n0-nhosts4-ngpu16-2023-11-02-183323.log
      Job started at: 2023-11-02-183323 on x3004c0s13b0n0
      [2023-11-02 18:34:13,122] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
    
      /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3015c0s37b0n0-nhosts4-ngpu16-2023-11-02-184240.log
      Job started at: 2023-11-02-184240 on x3015c0s37b0n0
    
      /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3015c0s37b0n0-nhosts4-ngpu16-2023-11-02-184259.log
      Job started at: 2023-11-02-184259 on x3015c0s37b0n0
      [2023-11-02 18:43:23,385] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
    
      /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_SP_actCkpt_GPT125M_z0_seqlen2048_mp16_pp1_sp1_nl12_hs768_gb1_mb1/logs/foremans-x3004c0s13b0n0-nhosts4-ngpu16-2023-11-02-184407.log
      Job started at: 2023-11-02-184407 on x3004c0s13b0n0
      [2023-11-02 18:44:32,804] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0, 0.0], mom=[(0.9, 0.999), (0.9, 0.999)]
    
      /lus/grand/projects/datascience/foremans/locations/polaris/projects/argonne-lcf/Megatron-DeepSpeed/outputs/gpt_actCkpt_GPT1T_4L_z1_seqlen2048_mp8_pp2_sp1_nl4_hs25600_gb16_mb2/logs/foremans-x3108c0s25b1n0-nhosts2-ngpu8-2023-11-02-192739.log
      Job started at: 2023-11-02-192739 on x3108c0s25b1n0
Tip

Table 1: Startup times on Perlmutter

****model_sizeworld_sizestartstopt0t1dt
foremans-nid008217-nhosts2-ngpu8-2023-10-05-191101.logGPT1T_1L82023-10-05-1911012023-10-05-191215191101191215114
foremans-nid008217-nhosts2-ngpu8-2023-10-05-191400.logGPT1T_1L82023-10-05-1914002023-10-05-191511191400191511111
foremans-nid008217-nhosts2-ngpu8-2023-10-05-191707.logGPT1T_1L82023-10-05-1917072023-10-05-191817191707191817110
foremans-nid008553-nhosts2-ngpu8-2023-10-15-114506.logGPT1T_2L82023-10-15-1145062023-10-15-114616114506114616110
foremans-nid008572-nhosts2-ngpu8-2023-10-15-133531.logGPT2_7B82023-10-15-1335312023-10-15-133745133531133745214
foremans-nid008572-nhosts2-ngpu8-2023-10-15-135041.logGPT2_7B82023-10-15-1350412023-10-15-135255135041135255214
foremans-nid008572-nhosts2-ngpu8-2023-10-15-140806.logGPT2_7B82023-10-15-1408062023-10-15-141236140806141236430
foremans-nid008572-nhosts2-ngpu8-2023-10-15-143120.logGPT2_7B82023-10-15-1431202023-10-15-143655143120143655535
foremans-nid008268-nhosts2-ngpu8-2023-10-15-154337.logGPT2_7B82023-10-15-1543372023-10-15-154446154337154446109
foremans-nid008268-nhosts2-ngpu8-2023-10-15-154943.logGPT1T_1L82023-10-15-1549432023-10-15-155317154943155317374
foremans-nid008268-nhosts2-ngpu8-2023-10-15-162315.logGPT1T_1L82023-10-15-1623152023-10-15-162441162315162441126
foremans-login12-nhosts2-ngpu8-2023-10-15-180714.logGPT2_7B82023-10-15-1807142023-10-15-18080518071418080591
foremans-login12-nhosts2-ngpu8-2023-10-15-181733.logGPT2_7B82023-10-15-1817332023-10-15-181834181733181834101
foremans-login12-nhosts2-ngpu8-2023-10-15-182228.logGPT1T_1L82023-10-15-1822282023-10-15-183031182228183031803
foremans-login12-nhosts2-ngpu8-2023-10-15-183345.logGPT1T_2L82023-10-15-1833452023-10-15-183750183345183750405
foremans-login12-nhosts2-ngpu8-2023-10-15-184442.logGPT1T_2L82023-10-15-1844422023-10-15-184727184442184727285
foremans-login12-nhosts2-ngpu8-2023-10-15-185952.logGPT1T_1L82023-10-15-1859522023-10-15-1900461859521900464094
foremans-nid008344-nhosts2-ngpu8-2023-10-15-191508.logGPT2_7B82023-10-15-1915082023-10-15-191608191508191608100
foremans-nid008344-nhosts2-ngpu8-2023-10-15-192404.logGPT2_7B82023-10-15-1924042023-10-15-192504192404192504100
foremans-nid008344-nhosts2-ngpu8-2023-10-15-193041.logGPT2_7B82023-10-15-1930412023-10-15-19313719304119313796
foremans-nid008344-nhosts2-ngpu8-2023-10-15-193448.logGPT2_7B82023-10-15-1934482023-10-15-19354019344819354092
foremans-login12-nhosts4-ngpu16-2023-10-15-195802.logGPT1T_1L162023-10-15-1958022023-10-15-195904195802195904102
foremans-login12-nhosts4-ngpu16-2023-10-15-200019.logGPT2_7B162023-10-15-2000192023-10-15-200258200019200258239
foremans-login12-nhosts4-ngpu16-2023-10-15-200902.logGPT2_7B162023-10-15-2009022023-10-15-201239200902201239337
foremans-login12-nhosts4-ngpu16-2023-10-15-201524.logGPT2_7B162023-10-15-2015242023-10-15-20161220152420161288
foremans-login12-nhosts4-ngpu16-2023-10-15-201834.logGPT2_7B162023-10-15-2018342023-10-15-20192320183420192389
foremans-login12-nhosts4-ngpu16-2023-10-15-202402.logGPT2_7B162023-10-15-2024022023-10-15-20250120240220250199
foremans-login12-nhosts4-ngpu16-2023-10-15-202606.logGPT2_7B162023-10-15-2026062023-10-15-202713202606202713107
foremans-nid008344-nhosts2-ngpu8-2023-10-16-084033.logGPT1T_1L82023-10-16-0840332023-10-16-0842128403384212179
foremans-nid008344-nhosts2-ngpu8-2023-10-16-084628.logGPT1T_1L82023-10-16-0846282023-10-16-0847288462884728100
foremans-nid008344-nhosts2-ngpu8-2023-10-16-085401.logGPT1T_1L82023-10-16-0854012023-10-16-0855058540185505104
foremans-nid008344-nhosts2-ngpu8-2023-10-16-090142.logGPT1T_1L82023-10-16-0901422023-10-16-0903059014290305163
foremans-nid008344-nhosts2-ngpu8-2023-10-16-093404.logactCkpt_GPT13B82023-10-16-0934042023-10-16-0935049340493504100
foremans-nid008572-nhosts4-ngpu16-2023-10-16-101437.logGPT1T_1L162023-10-16-1014372023-10-16-101549101437101549112
foremans-nid008396-nhosts4-ngpu16-2023-10-16-101512.logGPT1T_1L162023-10-16-1015122023-10-16-101615101512101615103
foremans-nid008396-nhosts4-ngpu16-2023-10-16-102217.logactCkpt_GPT25B162023-10-16-1022172023-10-16-102452102217102452235
foremans-nid008396-nhosts4-ngpu16-2023-10-16-102750.logactCkpt_GPT25B162023-10-16-1027502023-10-16-103243102750103243493
foremans-nid008572-nhosts4-ngpu16-2023-10-16-103113.logactCkpt_GPT25B162023-10-16-1031132023-10-16-103237103113103237124
foremans-nid008396-nhosts4-ngpu16-2023-10-16-104037.logactCkpt_GPT25B162023-10-16-1040372023-10-16-104148104037104148111
foremans-nid008396-nhosts4-ngpu16-2023-10-16-104819.logactCkpt_GPT25B162023-10-16-1048192023-10-16-1100021048191100025183
foremans-nid008396-nhosts4-ngpu16-2023-10-16-110119.logactCkpt_GPT25B162023-10-16-1101192023-10-16-110225110119110225106
foremans-nid008701-nhosts4-ngpu16-2023-10-16-113715.logactCkpt_GPT25B162023-10-16-1137152023-10-16-113824113715113824109
foremans-nid008701-nhosts4-ngpu16-2023-10-16-114236.logGPT1T_1L162023-10-16-1142362023-10-16-114338114236114338102
foremans-nid008701-nhosts4-ngpu16-2023-10-16-114610.logGPT1T_1L162023-10-16-1146102023-10-16-114711114610114711101
foremans-nid008701-nhosts4-ngpu16-2023-10-16-114819.logGPT1T_2L162023-10-16-1148192023-10-16-114953114819114953134
foremans-nid008701-nhosts4-ngpu16-2023-10-16-131058.logGPT1T_2L162023-10-16-1310582023-10-16-131203131058131203145
foremans-nid008576-nhosts1-ngpu4-2023-10-16-151427.logGPT1T_1L42023-10-16-1514272023-10-16-151600151427151600173
foremans-nid008576-nhosts1-ngpu4-2023-10-16-152528.logGPT1T_1L42023-10-16-1525282023-10-16-152640152528152640112
foremans-nid008224-nhosts1-ngpu4-2023-10-16-175717.logGPT1T_1L42023-10-16-1757172023-10-16-175829175717175829112
foremans-nid008224-nhosts1-ngpu4-2023-10-16-180457.logGPT1T_1L42023-10-16-1804572023-10-16-180605180457180605148
foremans-nid008224-nhosts1-ngpu4-2023-10-16-183116.logGPT1T_1L42023-10-16-1831162023-10-16-183216183116183216100
foremans-nid008224-nhosts1-ngpu4-2023-10-16-183921.logGPT1T_1L42023-10-16-1839212023-10-16-184033183921184033112
foremans-nid008237-nhosts1-ngpu4-2023-10-16-215614.logGPT1T_1L42023-10-16-2156142023-10-16-215815215614215815201
foremans-nid008385-nhosts1-ngpu4-2023-10-17-052944.logGPT1T_1L42023-10-17-0529442023-10-17-0531395294453139195
foremans-nid008385-nhosts1-ngpu4-2023-10-17-053529.logGPT1T_1L42023-10-17-0535292023-10-17-0536505352953650121
foremans-nid008385-nhosts1-ngpu4-2023-10-17-053910.logGPT1T_1L42023-10-17-0539102023-10-17-0541205391054120210
foremans-nid008385-nhosts1-ngpu4-2023-10-17-054238.logGPT2_7B42023-10-17-0542382023-10-17-0543465423854346108
foremans-nid008385-nhosts1-ngpu4-2023-10-17-060418.logGPT1T_1L42023-10-17-0604182023-10-17-0606006041860600182
foremans-nid008385-nhosts1-ngpu4-2023-10-17-061514.logGPT1T_1L42023-10-17-0615142023-10-17-0616536151461653139
foremans-nid008385-nhosts1-ngpu4-2023-10-17-062102.logGPT1T_1L42023-10-17-0621022023-10-17-0622526210262252150
foremans-nid008385-nhosts1-ngpu4-2023-10-17-062445.logGPT1T_1L42023-10-17-0624452023-10-17-0627206244562720275
foremans-nid008333-nhosts2-ngpu8-2023-10-17-064643.logGPT1T_1L82023-10-17-0646432023-10-17-0648486464364848205
foremans-nid008333-nhosts2-ngpu8-2023-10-17-065806.logGPT1T_2L82023-10-17-0658062023-10-17-07000365806700034197
foremans-nid008333-nhosts2-ngpu8-2023-10-17-075152.logGPT1T_2L82023-10-17-0751522023-10-17-0755027515275502350
foremans-nid008333-nhosts2-ngpu8-2023-10-17-080059.logGPT1T_2L82023-10-17-0800592023-10-17-0804348005980434375
foremans-nid008333-nhosts2-ngpu8-2023-10-17-081404.logGPT1T_2L82023-10-17-0814042023-10-17-0819208140481920516
foremans-nid008228-nhosts1-ngpu4-2023-10-17-090344.logGPT1T_1L42023-10-17-0903442023-10-17-0907149034490714370
foremans-nid008228-nhosts1-ngpu4-2023-10-17-100759.logGPT1T_1L42023-10-17-1007592023-10-17-100957100759100957198
foremans-nid008404-nhosts4-ngpu16-2023-10-17-182501.logGPT1T_1L162023-10-17-1825012023-10-17-1840011825011840011500
foremans-nid008404-nhosts4-ngpu16-2023-10-17-193736.logGPT1T_1L162023-10-17-1937362023-10-17-193856193736193856120
foremans-nid008404-nhosts4-ngpu16-2023-10-17-195432.logGPT1T_1L162023-10-17-1954322023-10-17-195536195432195536104
foremans-nid008404-nhosts4-ngpu16-2023-10-17-201659.logGPT1T_2L162023-10-17-2016592023-10-17-201823201659201823164
foremans-nid008404-nhosts4-ngpu16-2023-10-17-202949.logGPT1T_2L162023-10-17-2029492023-10-17-203054202949203054105
foremans-nid008404-nhosts4-ngpu16-2023-10-17-205848.logGPT1T_1L162023-10-17-2058482023-10-17-205952205848205952104
foremans-nid008577-nhosts8-ngpu32-2023-10-17-213244.logGPT1T_1L322023-10-17-2132442023-10-17-213406213244213406162
foremans-nid008577-nhosts8-ngpu32-2023-10-17-213558.logGPT1T_1L322023-10-17-2135582023-10-17-213720213558213720162
foremans-nid008577-nhosts8-ngpu32-2023-10-17-214900.logGPT1T_2L322023-10-17-2149002023-10-17-21495921490021495959
foremans-nid008577-nhosts8-ngpu32-2023-10-17-215201.logGPT1T_2L322023-10-17-2152012023-10-17-215309215201215309108
foremans-nid008577-nhosts8-ngpu32-2023-10-17-215612.logGPT1T_2L322023-10-17-2156122023-10-17-215726215612215726114
foremans-nid008577-nhosts8-ngpu32-2023-10-17-215938.logGPT1T_2L322023-10-17-2159382023-10-17-2200442159382200444106
foremans-nid008529-nhosts8-ngpu32-2023-10-18-110001.logGPT1T_4L322023-10-18-1100012023-10-18-110143110001110143142
foremans-nid008529-nhosts8-ngpu32-2023-10-18-110424.logGPT1T_8L322023-10-18-1104242023-10-18-110550110424110550126
foremans-nid008244-nhosts4-ngpu16-2023-10-18-110821.logGPT1T_8L162023-10-18-1108212023-10-18-110952110821110952131
foremans-nid008529-nhosts8-ngpu32-2023-10-18-111345.logGPT1T_8L322023-10-18-1113452023-10-18-111458111345111458113
foremans-nid008197-nhosts16-ngpu64-2023-10-18-112531.logGPT1T_16L642023-10-18-1125312023-10-18-112728112531112728197
foremans-nid008456-nhosts16-ngpu64-2023-10-18-113119.logGPT1T_16L642023-10-18-1131192023-10-18-113343113119113343224
foremans-nid008244-nhosts4-ngpu16-2023-10-18-113131.logGPT1T_4L162023-10-18-1131312023-10-18-113257113131113257126
foremans-nid008244-nhosts4-ngpu16-2023-10-18-113920.logGPT1T_4L162023-10-18-1139202023-10-18-114157113920114157237
foremans-nid008197-nhosts16-ngpu64-2023-10-18-114549.logGPT1T_16L642023-10-18-1145492023-10-18-114721114549114721172
foremans-nid008456-nhosts16-ngpu64-2023-10-18-114636.logGPT1T_16L642023-10-18-1146362023-10-18-114805114636114805169
foremans-nid008244-nhosts4-ngpu16-2023-10-18-115808.logGPT1T_4L162023-10-18-1158082023-10-18-1201461158081201464338
foremans-nid008456-nhosts16-ngpu64-2023-10-18-123039.logGPT1T_16L642023-10-18-1230392023-10-18-123221123039123221182
foremans-nid008389-nhosts2-ngpu8-2023-10-18-123135.logGPT1T_4L82023-10-18-1231352023-10-18-123300123135123300165
foremans-nid008244-nhosts4-ngpu16-2023-10-18-123206.logGPT1T_4L162023-10-18-1232062023-10-18-123352123206123352146
foremans-nid008456-nhosts16-ngpu64-2023-10-18-125022.logGPT1T_16L642023-10-18-1250222023-10-18-125146125022125146124
foremans-nid008256-nhosts8-ngpu32-2023-10-22-122736.logGPT1T_8L322023-10-22-1227362023-10-22-122844122736122844108
foremans-nid008256-nhosts8-ngpu32-2023-10-22-123824.logGPT1T_8L322023-10-22-1238242023-10-22-123945123824123945121
foremans-nid008256-nhosts8-ngpu32-2023-10-22-130148.logGPT1T_8L322023-10-22-1301482023-10-22-130256130148130256108
foremans-nid008256-nhosts8-ngpu32-2023-10-22-131746.logGPT1T_8L322023-10-22-1317462023-10-22-131909131746131909163
foremans-nid008256-nhosts8-ngpu32-2023-10-22-132700.logGPT1T_8L322023-10-22-1327002023-10-22-132817132700132817117
foremans-nid008256-nhosts8-ngpu32-2023-10-22-133459.logGPT1T_8L322023-10-22-1334592023-10-22-133708133459133708249
foremans-nid008380-nhosts4-ngpu16-2023-10-22-175049.logactCkpt_GPT25B162023-10-22-1750492023-10-22-175230175049175230181
foremans-nid008649-nhosts4-ngpu16-2023-10-22-192352.logGPT1T_4L162023-10-22-1923522023-10-22-192530192352192530178
foremans-nid008212-nhosts16-ngpu64-2023-10-23-081527.logGPT1T_8L642023-10-23-0815272023-10-23-0817028152781702175
foremans-nid008344-nhosts2-ngpu8-2023-10-23-091436.logGPT1T_2L82023-10-23-0914362023-10-23-0916109143691610174
foremans-nid008197-nhosts32-ngpu128-2023-10-24-102617.logGPT1T_32L1282023-10-24-1026172023-10-24-102826102617102826209
foremans-nid008192-nhosts64-ngpu256-2023-10-24-191748.logGPT1T_64L2562023-10-24-1917482023-10-24-192021191748192021273
foremans-nid008192-nhosts128-ngpu512-2023-10-24-201243.logGPT1T_128L5122023-10-24-2012432023-10-24-201629201243201629386
foremans-nid008192-nhosts128-ngpu512-2023-10-26-005401.logGPT1T_128L5122023-10-26-0054012023-10-26-00581154015811410
foremans-nid008192-nhosts32-ngpu128-2023-10-26-082710.logGPT1T_32L1282023-10-26-0827102023-10-26-0830498271083049339
foremans-nid008585-nhosts2-ngpu8-2023-10-31-044203.logGPT1T_2L82023-10-31-0442032023-10-31-0445334420344533330
foremans-nid008272-nhosts4-ngpu16-2023-10-31-072717.logGPT1T_4L162023-10-31-0727172023-10-31-0731317271773131414
foremans-nid008221-nhosts8-ngpu32-2023-10-31-083055.logGPT1T_8L322023-10-31-0830552023-10-31-0835458305583545490
foremans-nid008196-nhosts16-ngpu64-2023-10-31-100336.logGPT1T_16L642023-10-31-1003362023-10-31-100848100336100848512
foremans-nid008285-nhosts2-ngpu8-2023-11-01-200430.logGPT1T_2L82023-11-01-2004302023-11-01-200829200430200829399
foremans-nid008193-nhosts8-ngpu32-2023-11-01-201702.logGPT1T_8L322023-11-01-2017022023-11-01-202131201702202131429
foremans-nid008240-nhosts16-ngpu64-2023-11-01-210454.logGPT1T_16L642023-11-01-2104542023-11-01-211007210454211007553
foremans-nid008321-nhosts2-ngpu8-2023-11-02-154438.logGPT1T_2L82023-11-02-1544382023-11-02-154949154438154949511
foremans-nid008192-nhosts128-ngpu512-2023-11-04-001717.logGPT1T_128L5122023-11-04-0017172023-11-04-00212417172124407

Minimal Working Example

  • As for 3:

    If we need to report the startup time for the DL applications, do we need to collect measurements using the actual Aurora NRE workloads or some small benchmarking test cases? For example, we can try to recreate the typical start-up scenarios, like library imports, and measure those separately as shown below.

    • I’ve been working on a library to help simplify this:

      ezpz
      Minimal library that handles the initialization of distributed training

  •   Working on Aurora, example:

    • Setup / Install:

      # launch job
      $ qsub -q EarlyAppAccess -A Aurora_Deployment -l walltime=2:00:00 -l select=4 -I
      
      # load frameworks
      $ module use -a /soft/modulefiles ; module --ignore_cache load frameworks
      $ module load frameworks/.2023.12.15.001
      
      # install `ezpz`
      $ git clone https://github.com/saforem2/ezpz
      $ cd ezpz
      $ mkdir -p venvs/aurora/2023.12.15.001
      $ python3 -m venv venvs/aurora/2023.12.15.001 --system-site-packages
      $ source venvs/aurora/2023.12.15.001/bin/activate
      $ python3 -m pip install -e .
      
      # print job info and define `launch` alias
      $ source ezpz/src/ezpz/bin/savejobenv
      ┌──────────────────────────────────────────────────────────────────
       [Hosts]:
       x4415c6s5b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov
      x4415c6s6b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov
      x4415c6s7b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov
      x4415c7s0b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov
      └──────────────────────────────────────────────────────────────────
      ┌──────────────────────────────────────────────────────────────────
       [DIST INFO]:
       Loading job env from: /home/foremans/.pbsenv
       HOSTFILE: /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
       NHOSTS: 4
       NGPU_PER_HOST: 12
       NGPUS (NHOSTS x NGPU_PER_HOST): 48
       DIST_LAUNCH: mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
       Defining alias: launch: aliased to mpiexec --verbose --envall -n 48 -ppn 12 --hostfile /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      └──────────────────────────────────────────────────────────────────
    • Launch with framework=pytorch, backend=DDP:

      # ----------------------------------------------------------
      # launch + startup on all workers with
      # • `framework` ∈ {`pytorch`, `tensorflow`}
      # • `backend` ∈ {`horovod`, `deepspeed`, `DDP`}
      # where `deepspeed` and `DDP` only available for `pytorch`
      # ----------------------------------------------------------
      $ launch python3 -m ezpz framework=pytorch backend=DDP
      [2023-12-19 13:33:24][INFO][dist.py:292] - Using device='xpu'
      [2023-12-19 13:33:26][INFO][dist.py:243] - Using DDP for distributed training
      [2023-12-19 13:33:26][WARNING][dist.py:104] - Using backend='ccl'
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 1 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 2 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 3 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 4 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 0 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 5 / 47
      [2023-12-19 13:33:35][INFO][__main__.py:49] - {
          "_target_": "ezpz.configs.TrainConfig",
          "framework": "pytorch",
          "backend": "DDP",
          "ds_config_path": null,
          "port": null,
          "seed": null,
          "use_wandb": true,
          "wandb_project_name": null,
          "precision": null,
          "ngpus": null
      }
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 9 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 10 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 11 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 7 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 8 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 6 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 12 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 13 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 14 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 15 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 18 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 19 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 20 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 21 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 22 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 23 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 24 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 25 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 26 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 27 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 30 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 16 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 17 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 28 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 32 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 33 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 36 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 37 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 38 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 39 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 43 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 46 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 29 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 47 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 31 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 34 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 35 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 42 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 41 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 44 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 45 / 47
      [2023-12-19 13:33:35][INFO][dist.py:307] - RANK: 40 / 47
      [2023-12-19 13:33:47][INFO][dist.py:415] - Setting up wandb from rank: 0
      [2023-12-19 13:33:47][INFO][dist.py:416] - Using: WB PROJECT: ezpz
      [2023-12-19 13:33:58][INFO][dist.py:448] - W&B RUN: [flowing-wood-8](https://wandb.ai/l2hmc-qcd/ezpz/runs/uya29gm5)
      [2023-12-19 13:33:58][INFO][dist.py:490] - Running on x4415c6s5b0n0.hostmgmt2415.cm.aurora.alcf.anl.gov
      [2023-12-19 13:33:58][INFO][dist.py:506] - Reading hosts from /var/spool/pbs/aux/297306.aurora-pbs-0001.hostmgmt.cm.aurora.alcf.anl.gov
      [2023-12-19 13:33:58][INFO][__main__.py:57] - Output dir: /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17
      [2023-12-19 13:33:58][CRITICAL][dist.py:519] - 🚀 flowing-wood-8
      [2023-12-19 13:33:58][CRITICAL][dist.py:520] - 🔗 https://wandb.ai/l2hmc-qcd/ezpz/runs/uya29gm5
      [2023-12-19 13:33:58][CRITICAL][dist.py:521] - 📂/: /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/wandb/run-20231219_133354-uya29gm5/files
      [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/ezpz-pt-DDP-xpu.log to W&B artifact...
      [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/__main__.log to W&B artifact...
      [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-17/main_debug.log to W&B artifact...
      [2023-12-19 13:33:58][INFO][dist.py:563] - Adding /lus/gecko/projects/Aurora_deployment/foremans/projects/saforem2/ezpz/src/ezpz/outputs/runs/pytorch/DDP/2023-12-19/13-33-16/__main__.log to W&B artifact...