 Command
Site Info

sam.onl is a terminal-style knowledge base and notes hub, built with Astro + WebTUI. Use the keybinds below to move between navbar, content, and sidebars, then customize the look with the theme picker.

Theme
Keybinds
Navigation
j / ↓ Next item k / ↑ Previous item g First item in region G Last item in region zz Center focused item h / l Move left/right region ] / [ Next/previous heading } / { Next/previous block ⌃D / ⌃U Half-page down/up
Layout
<zh> / <zl> Toggle left/right sidebar <zj> / <zk> Focus main/navbar <S-h/j/k/l> Focus left/main/navbar/right ⌃H / ⌃L Focus left/right sidebar ⌃J / ⌃K Focus main/navbar
Dialogs
⌃P / : Command palette ⌃X Theme picker / Search ? Show keybinds Esc / ⌃C Close dialog
History
⌃N Next document ⌃B Previous document ⌃O History back ⌃I History forward
 Search
landing: Sam Foreman about: 🪪 About docs/test: Docs Test ideas: 💡 Ideas now: Now more: ➕ More projects: 📚 Projects posts: 📬 Posts talks: 🎙️ Talks posts/2025: 📆 2025 posts/auroragpt: 🤖 AuroraGPT posts/ai-for-physics: ⚛️ AI for Physics posts/dope-slides: 💅 How to Make Dope Slides posts/ezpz-at-alcf: 🍋 ezpz @ ALCF posts/jupyter: 📗 Jupyter posts/resume: 🧑🏻‍💻 Sam Foreman’s Résumé posts/ezpz-v1: 📝 ezpz-v1 posts/torchtune-aurora: 🪛 Torchtune on Aurora posts/torchtune-patch-aurora: 🚑 Torchtune Patch on Aurora posts/svgbob: 🫥 svgbob talks/auroragpt-siam25: AuroraGPT talks/ai-for-science-2024: Parallel Training Methods talks/alcf-hpc-workshop-2024/alcf-hpc-workshop-2024: Deep Learning and Foundation Models at Scale talks/aurora-gpt-fm-for-electric-grid/auroragpt-fm-for-electric-grid: AuroraGPT: Foundation Models for Science talks/hpc-user-forum/auroragpt: AuroraGPT talks/incite-hackathon-2025: ALCF Incite Hackathon 2025 talks/llms-at-scale: Training LLMs at Scale talks/llms-on-polaris: Training LLMs on Polaris talks/openskai25: Open SkAI2025 webtui/components/accordion: Accordion webtui/components/badge: Badge webtui/components/button: Button webtui/components/checkbox: Checkbox webtui/components/dialog: Dialog webtui/components/input: Input webtui/components/popover: Popover webtui/components/pre: Pre webtui/components/progress: Progress webtui/components/radio: Radio webtui/components/range: Range webtui/components/separator: Separator webtui/components/spinner: Spinner webtui/components/switch: Switch webtui/components/table: Table webtui/components/textarea: Textarea webtui/components/tooltip: Popover webtui/components/typography: Typography webtui/components/view: View webtui/contributing/contributing: Contributing webtui/contributing/contributing: ## Local Development webtui/contributing/contributing: ## Issues webtui/contributing/contributing: ## Pull Requests webtui/contributing/style-guide: Style Guide webtui/contributing/style-guide: ## CSS Units webtui/contributing/style-guide: ## Selectors webtui/contributing/style-guide: ## Documentation webtui/installation/astro: Astro webtui/installation/astro: ## Scoping webtui/installation/astro: ### Frontmatter Imports webtui/installation/astro: ### <style> tag webtui/installation/astro: ### Full Library Import webtui/installation/nextjs: Next.js webtui/plugins/plugin-dev: Developing Plugins webtui/plugins/plugin-dev: ### Style Layers webtui/installation/vite: Vite webtui/plugins/theme-catppuccin: Catppuccin Theme webtui/plugins/plugin-nf: Nerd Font Plugin webtui/plugins/theme-custom: Custom Theme webtui/plugins/theme-everforest: Everforest Theme webtui/plugins/theme-gruvbox: Gruvbox Theme webtui/plugins/theme-nord: Nord Theme webtui/plugins/theme-vitesse: Vitesse Theme webtui/start/ascii-boxes: ASCII Boxes webtui/start/changelog: Changelog webtui/start/installation: Installation webtui/start/installation: ## Installation webtui/start/installation: ## Using CSS webtui/start/installation: ## Using ESM webtui/start/installation: ## Using a CDN webtui/start/installation: ## Full Library Import webtui/start/installation: ### CSS webtui/start/installation: ### ESM webtui/start/installation: ### CDN webtui/start/intro: Introduction webtui/start/intro: ## Features webtui/start/plugins: Plugins webtui/start/plugins: ## Official Plugins webtui/start/plugins: ### Themes webtui/start/plugins: ## Community Plugins webtui/start/theming: Theming webtui/start/theming: ## CSS Variables webtui/start/theming: ### Font Styles webtui/start/theming: ### Colors webtui/start/theming: ### Light & Dark webtui/start/theming: ## Theme Plugins webtui/start/theming: ### Using Multiple Theme Accents webtui/start/tuis-vs-guis: TUIs vs GUIs webtui/start/tuis-vs-guis: ## Monospace Fonts webtui/start/tuis-vs-guis: ## Character Cells posts/2025/06: 06 posts/auroragpt/aurora-gpt: 🏎️ Megatron-DeepSpeed on Intel XPU posts/auroragpt/checkpoints: 💾 Converting Checkpoints posts/auroragpt/flash-attn-sunspot: 📸 `flash-attn` on Sunspot posts/auroragpt/determinstic-flash-attn/deterministic-flash-attn: 🎰 Deterministic `flash-attn` posts/auroragpt/long-sequences: 🚂 Loooooooong Sequence Lengths posts/auroragpt/mpi4py-reproducer: 🐛 `mpi4py` bug on Sunspot posts/auroragpt/spike-skipper: 🏔️ Spike Skipper posts/auroragpt/startup-times: 🐢 Starting Up Distributed Training on Aurora posts/auroragpt/startup-times: ## Response posts/auroragpt/startup-times: ### Measuring / Calculating Startup Time posts/auroragpt/startup-times: ## Minimal Working Example posts/ai-for-physics/diffusion: 🎲 MCMC + Diffusion Sampling posts/ai-for-physics/l2hmc-qcd: 🎢 L2HMC for LQCD posts/jupyter/l2hmc-4dsu3: 🔳 `l2hmc-qcd` Example: 4D SU(3) posts/jupyter/test: 🏁 `l2hmc` Example: 2D $U(1)$ talks/auroragpt/alcf-hpc-workshop-2024/auroragpt-alcf-hands-on-hpc-workshop-2024: AuroraGPT: ANL's General Purpose Scientific LLM talks/incite-hackathon-2025/auroragpt: LLMs on Aurora: Overview talks/incite-hackathon-2025/ezpz: LLMs on Aurora: Hands-On talks/openskai25/ai4science: Scientific AI at Scale: AuroraGPT talks/openskai25/training: Scientific AI at Scale: Distributed Training posts/2025/04/28: 🔥 Building PyTorch 2.6 from Source on Aurora posts/2025/05/03: 🚧 Frameworks Issue with numpy \> 2 posts/2025/06/01: 📰 Nice Headings posts/2025/06/02: 🧜‍♀️ Mermaid posts/2025/06/14: 🏗️ Building PyTorch 2.8 from Source on Aurora posts/2025/10/06: 🎨 Mixing Between Distributions While Training posts/2025/09/12: 🍹 BlendCorpus + TorchTitan @ ALCF posts/2025/09/17: 📊 `pbs-tui`: TUI for PBS Job Scheduler Monitoring posts/2025/11/12: 🧊 Cooling Down Checkpoints: Best Practices for Model Evaluation posts/2026/01/07: 🎉 Happy New Year! posts/2026/01/10: 🍋 ezpz posts/2026/02/28: ⏱️ Comparing Launchers on Aurora posts/2026/02/28: ## torchrun posts/2026/02/28: ## ezpz posts/ai-for-physics/l2hmc-qcd/2du1: 🎢 l2hmc-qcd Example: 2D U(1) posts/jupyter/l2hmc/4dsu3: 🔳 l2hmc-qcd Example: 4D SU(3) talks/2025/09/24: Training Foundation Models on Supercomputers talks/2025/10/08: AERIS: Argonne's Earth Systems Model posts/ai-for-physics/l2hmc-qcd/4dsu3nb/index-broken: 🕸️ l2hmc-qcd Example: 4D SU(3) talks/2025/10/15: Training Foundation Models on Supercomputers talks/2025/12/16: AuroraGPT: Training Foundation Models on Supercomputers talks/2025/10/24: Training Foundation Models on Supercomputers posts/drafts/2025/09/22: 📝 2025 Annual Report
 Theme Current: Light j/k or ↑/↓ + Enter

🏔️ Spike Skipper

Sam Foreman 2024-09-17

Details

We describe below our implementation for skipping individual steps during training.

📝 Example

Suppose we observe a large spike in our loss curve, as shown below:

spike-skipper

Seemingly, this spike is being caused by a batch of “bad data”. In order to prevent this “bad data” sample from corrupting our training, we would like to “skip” that particular training step.

This can be accomplished by passing the keyword argument --train-range-to-skip and specifying the endpoints of the ranges to be skipped.

e.g., if you would like to skip all steps from [10, 20] and from [25, 30], we would specify:

PBS_O_WORKDIR=$(pwd) bash train_aGPT_7B.sh \
    --train-range-to-skip 10 20 25 30

🧪 Implementation

We discuss below the details of the implementation, and provide some simple results to confirm things are behaving how we expect.

  1. Check if args.train_range_to_skip is not None [here]

    • Assert len(args.train_range_to_skip) % 2 == 0 [here]

      Must be even since we’re specifying the endpoints of intervals to skip

    • Zip these up into pairs [here]:

      ranges_to_skip = list(
          zip(
              args.train_range_to_skip[::2],
              args.train_range_to_skip[1::2]
          )
      )
  2. If current iteration is in any of these pairs [here]

✅ Sanity Check

In order to confirm things are behaving as expected, we can explicitly look at the tokens drawn for each step, and ensure that they are the same regardless of whether or not that iteration was skipped.

  • In particular, we see that:

    • test 1:

      # [2024-09-16 23:09:09.059118][INFO][training:1083] - iteration=2 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor(
          [[ 1858,  3851, 29889,  ...,   500,    13,    13],
           [  349,  6156,  1650,  ...,  5806, 28557,  3519],
           [16554,   304,  1653,  ...,   322,  6934, 14722],
           [ 4955,   310, 10465,  ...,  1438,  3841, 29892]]
      )
      # [2024-09-16 23:09:09.061999][INFO][training:1083] - iteration=2 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor(
          [[  363,  1302, 16453,  ...,  7967, 29891,   484],
           [  367,   766,  4752,  ...,     1, 29871, 30143],
           [29899,   855,  1503,  ...,  3786, 29892,  5100],
           [  465,  1974,   289,  ..., 21588,   533,   304]]
      )
    • test 2:

      # [2024-09-16 22:59:27.752277][INFO][pretrain_gpt_alcf:198] - args.iteration=2:
      data['text'][:10]=tensor(
          [[ 1858,  3851, 29889,  ...,   500,    13,    13],
           [  349,  6156,  1650,  ...,  5806, 28557,  3519],
           [16554,   304,  1653,  ...,   322,  6934, 14722],
           [ 4955,   310, 10465,  ...,  1438,  3841, 29892]]
      )
      # [2024-09-16 22:59:27,755] [INFO] [profiler.py:81:start_profile] Flops profiler started
      # [2024-09-16 22:59:28.568805][INFO][pretrain_gpt_alcf:198] - args.iteration=2:
      data['text'][:10]=tensor(
          [[363,  1302, 16453,  ...,  7967, 29891,   484],
           [  367,   766,  4752,  ...,     1, 29871, 30143],
           [29899,   855,  1503,  ...,  3786, 29892,  5100],
           [  465,  1974,   289,  ..., 21588,   533,   304]]
      )

    as expected.

🔍 Details

  • First 4 steps:

    tokens:
    • Iteration 0:

      [2024-09-16 22:58:50.168667][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  304,  7344,  5146,  ...,  9776, 29914, 26419],
              [29889,    13,  4706,  ...,  9280, 30004,    13],
              [29943, 20774, 29908,  ...,   304, 27391,   322],
              [ 2645,   445, 29871,  ..., 16888,  4656, 10070]])
      [2024-09-16 22:58:58.866409][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 2768,   596,  1788,  ..., 27274,   393, 30010],
              [  278,  5613,  4192,  ...,   362,   310,  1950],
              [28038, 29892,  2022,  ...,  3160,   278,  2087],
              [ 4149,   907, 29888,  ..., 29896, 29892, 29896]])
      [2024-09-16 22:59:02.043059][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  424,   322, 16232,  ...,   366,   748,   467],
              [   13,   462,  1678,  ...,  2084, 29892,  3497],
              [ 7562,   310, 19320,  ...,  8973, 22684,   358],
              [ 2089,  3633,   292,  ..., 13774,   269,  2375]])
      [2024-09-16 22:59:03.456919][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[21411,   322,  3896,  ...,  2610, 29889,   319],
              [ 8003, 29898, 29900,  ...,    12,  6658,   529],
              [  278,  4148,   310,  ...,   263, 12212,   282],
              [ 5977, 29871, 29906,  ..., 15332,   310,  1749]])
      [2024-09-16 22:59:04.596630][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  278,  1473, 24987,  ...,   263,  2217,  3804],
              [ 2973,   263, 18778,  ...,   263,  4642,  6673],
              [  309,   323,   804,  ...,  1063, 15296,   327],
              [  278,  5864,   322,  ...,  9409, 29889,  2178]])
      [2024-09-16 22:59:05.486913][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[29892, 13731,  6617,  ..., 29871, 29896, 29946],
              [ 2892,  1012,  1266,  ...,  4036,  7512,  2068],
              [ 1473,  1556,  3619,  ...,  3762,   338,   263],
              [23353, 29918,  2177,  ...,   501,   567,   814]])
      [2024-09-16 22:59:06.361333][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 5400, 14378,  4768,  ...,  2107, 18677, 29889],
              [ 9200, 29887, 29914,  ...,   293, 24235,   322],
              [30143,  4746,  2184,  ..., 11891, 29974, 25760],
              [19263, 29914,   303,  ...,   358, 29889,    13]])
      [2024-09-16 22:59:07.230671][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  309,  1306,   681,  ...,   310, 23186, 21809],
              [29896, 29929,    13,  ..., 29871, 29900,    13],
              [ 9558,   964,   263,  ...,   322,   282,   682],
              [  278, 23904, 21767,  ...,   313, 29929, 29889]])
    • Iteration 1:

      [2024-09-16 22:59:19.287338][INFO][training_log:661] -  iteration=       1/  635782 | consumed_samples=         768 | consumed_tokens=     3145728 | elapsed_time_per_iteration_ms=29570.0 | learning_rate=9.4372e-09 | global_batch_size=768 | lm loss=11.167250 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=25.972 | tokens_per_gpu_per_second_tgs=4432.597 | [LM]TFLOPs=20.30 | [DS]TFLOPs=26.18 |
      [2024-09-16 22:59:19.289582][INFO][utils:207] - [Rank 0] (after 1 iterations) memory (MB) | allocated: 1894.57666015625 | max allocated: 9752.35498046875 | reserved: 11342.0 | max reserved: 11342.0
      (min, max) time across ranks (ms):
        forward-backward ...............................: (26094.39, 26095.09)
        optimizer ......................................: (3407.56, 3409.92)
      [2024-09-16 22:59:19.297183][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 1472, 29892,   408,  ..., 29892,  1584,   363],
            [  967, 19475,  6593,  ...,  8093, 29899, 11249],
            [ 1006,  2218, 13326,  ...,  2355,  1304,   304],
            [29900, 29916, 29947,  ...,   353,  1870, 29936]])
      [2024-09-16 22:59:20.104352][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 2354,   274,  1041,  ..., 29892, 13049,  9098],
            [ 8798,  9547, 10353,  ...,   303,  3143, 29889],
            [ 1373,  4056,  7236,  ...,  3186,   297,  5837],
            [ 1738, 29920,  7355,  ...,    13, 29871,  3776]])
      [2024-09-16 22:59:20.977036][INFO][utils:326] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/c4-0000_text_document
      [2024-09-16 22:59:20.977877][INFO][utils:326] -  > building dataset index ...
      [2024-09-16 22:59:20.977147][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 2020,   306,  1016,  ...,   322,   920,   372],
            [ 5921,  1749,  7306,  ..., 19252,   297,  5664],
            [  970,   770, 28547,  ...,   970,   894,  2577],
            [ 1907,   363, 14188,  ...,   756,  3646,   287]])
      [2024-09-16 22:59:21.851620][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[  715, 25392,  3104,  ...,   289,  5761,   616],
            [  426,    13,  9651,  ...,  9651,  1815, 22603],
            [ 7714,  1213,    13,  ...,    13, 29876,   457],
            [29889, 28663,  1230,  ...,  1546,   278,  6586]])
      [2024-09-16 22:59:22.720945][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[29929,    13,    13,  ..., 10739,  4770, 11277],
            [ 4528,   304,  2367,  ...,  2501,   385,  4203],
            [  869,   319,   794,  ...,  3158, 29889,  3115],
            [  592,   260,  4125,  ...,   284,  1135, 18655]])
      [2024-09-16 22:59:23.590149][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[14338, 25323,  3321,  ...,  5607,  1806,  1164],
            [  322,   278, 15352,  ...,  6462,   313,  1552],
            [25738,   714, 29889,  ..., 29915, 29879, 24842],
            [ 5122,   399, 29889,  ..., 29947,  7284,  2305]])
      [2024-09-16 22:59:24.457646][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[  367, 19310,  1891,  ...,  2408,   292,   263],
            [  470,  3307,  5713,  ...,   568,  2594, 19385],
            [29953, 29905,  1631,  ...,  1118,   343, 29897],
            [10261,   373,  5490,  ...,   511,   297,  1760]])
      [2024-09-16 22:59:25.326699][INFO][pretrain_gpt_alcf:198] - args.iteration=1: data['text'][:10]=tensor([[ 1006,   326, 29901,  ..., 14834,  6694,  9595],
            [12058,  5446, 29892,  ..., 29889,  8246,  3310],
            [ 7483,   310,   278,  ...,   402,  9851,  4423],
            [ 8041,   813,   322,  ...,  3303,  3900,   393]])
    • Iteration 2:

      [2024-09-16 22:59:27.744603][INFO][training_log:661] -  iteration=       2/  635782 | consumed_samples=        1536 | consumed_tokens=     6291456 | elapsed_time_per_iteration_ms=8457.2 | learning_rate=1.88744e-08 | global_batch_size=768 | lm loss=11.164009 | loss_scale=1.0 | grad_norm=6.271 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=90.810 | tokens_per_gpu_per_second_tgs=15498.234 | [LM]TFLOPs=70.98| [DS]TFLOPs=91.53 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (8384.83, 8385.57)
        optimizer ......................................: (55.03, 55.61)
      [2024-09-16 22:59:27.752277][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[ 1858,  3851, 29889,  ...,   500,    13,    13],
            [  349,  6156,  1650,  ...,  5806, 28557,  3519],
            [16554,   304,  1653,  ...,   322,  6934, 14722],
            [ 4955,   310, 10465,  ...,  1438,  3841, 29892]])
      [2024-09-16 22:59:27,755] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:28.568805][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  363,  1302, 16453,  ...,  7967, 29891,   484],
            [  367,   766,  4752,  ...,     1, 29871, 30143],
            [29899,   855,  1503,  ...,  3786, 29892,  5100],
            [  465,  1974,   289,  ..., 21588,   533,   304]])
      [2024-09-16 22:59:28,571] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:29.440843][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[29889,    13,  4806,  ...,  3086, 26040,  9220],
            [  293,  7207,   355,  ..., 18131,   520,  1247],
            [ 8619, 29889, 29871,  ...,   304, 10029,   266],
            [  363, 15202, 29892,  ...,   482, 17162, 19104]])
      [2024-09-16 22:59:29,443] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:30.313403][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[25561,   411,   278,  ...,   297,  2898, 26163],
            [22574,  2607, 18134,  ...,    13,  4706,   500],
            [20190, 24820,  1623,  ...,   310,   901, 29892],
            [29892,  1951,  4486,  ...,   869,   887, 30010]])
      [2024-09-16 22:59:30,316] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:31.185339][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[ 5371, 22417, 29892,  ...,    13,  6716,   901],
            [  353,  1565, 29936,  ..., 29878,  3567,  7196],
            [17296,   338,  1985,  ...,  3741,  9089,   422],
            [  694, 13331,   310,  ..., 21180, 29892,   607]])
      [2024-09-16 22:59:31,188] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:32.057207][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  292,  8818,   267,  ..., 29892, 11275,  7407],
            [ 1870, 29897,    13,  ...,  2697, 29901,    13],
            [29913,   338,   263,  ..., 29892,   591,  3394],
            [ 2253,   472,  1554,  ...,   982,   304,   376]])
      [2024-09-16 22:59:32,060] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:32.930293][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  391,  2598, 29883,  ..., 22629,   346,   440],
            [29871, 29896, 29906,  ...,   407,   583,  2833],
            [ 4262,  1836,    13,  ...,   310,   263, 10608],
            [ 1199,   411, 24770,  ...,   272,  2153, 29889]])
      [2024-09-16 22:59:32,932] [INFO] [profiler.py:81:start_profile] Flops profiler started
      [2024-09-16 22:59:33.803567][INFO][pretrain_gpt_alcf:198] - args.iteration=2: data['text'][:10]=tensor([[  620, 20503,   428,  ...,   297,  1009,  9443],
            [  950, 25078,   892,  ...,   408, 10636,   284],
            [ 1012,  2003,   364,  ...,  7313, 29912, 19303],
            [29906, 29892, 29945,  ...,   967, 26414,   472]])
    • Iteration 3:

      [2024-09-16 22:59:34.881265][INFO][training_log:661] -  iteration=       3/  635782 | consumed_samples=        2304 | consumed_tokens=     9437184 | elapsed_time_per_iteration_ms=7136.5 | learning_rate=2.83116e-08 | global_batch_size=768 | lm loss=11.164038 | loss_scale=1.0 | grad_norm=6.279 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=107.615 | tokens_per_gpu_per_second_tgs=18366.372 | [LM]TFLOPs=84.12 | [DS]TFLOPs=108.46 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (7078.48, 7079.28)
        optimizer ......................................: (38.62, 43.08)
      [2024-09-16 22:59:34.888870][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  496,   313, 29941,  ...,  1316,   408,  4857],
            [29899,  3204, 29889,  ...,  1074,   330,  2547],
            [29916, 29900, 29946,  ..., 18455, 29889,  4002],
            [26406,   338,  1641,  ...,   670,  1914,  6900]])
      [2024-09-16 22:59:35.719630][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29945, 29900,   867,  ...,  7601, 12091,   310],
            [  975, 29871, 29896,  ...,  3573,   825,   306],
            [29906, 29900,  4638,  ..., 29227, 23145, 29892],
            [  278, 14368,   322,  ..., 14909, 29936, 25913]])
      [2024-09-16 22:59:36.591343][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  988,   306,  1033,  ...,   437,   408,  1532],
            [  450, 10317,   310,  ...,   322,   752, 13036],
            [11405,  8020, 29889,  ...,   471, 18096,   287],
            [  288,  3594, 19284,  ...,   910,   338,   385]])
      [2024-09-16 22:59:37.463941][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[  322, 15151, 29946,  ..., 11648,  1497, 29889],
            [24233,   362,   467,  ...,  4513,  1353,   322],
            [ 3311, 13605, 29912,  ...,   945, 29899,  4181],
            [ 1951,   366,   508,  ...,  6589,   491,   777]])
      [2024-09-16 22:59:38.343307][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29889, 29900,    13,  ...,  6017,   424,  1711],
            [  297,  5500,  1489,  ...,   310,  3802,  7875],
            [ 8078,  5314,   515,  ...,   373,   278,  6991],
            [13763,  6204,  6359,  ...,  4706,  2024,  1347]])
      [2024-09-16 22:59:39.214871][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[   13,  4806,  3512,  ...,   278,  7824,  6438],
            [ 2294,   938,   903,  ...,  4537,  3047,   449],
            [ 1230,  4123,   767,  ...,   310,   963, 21003],
            [ 1152,  2319, 10365,  ...,   367, 14040,   363]])
      [2024-09-16 22:59:40.085368][INFO][utils:326] -  >> building dataset for /flare/Aurora_deployment/AuroraGPT/datasets/dolma/data_v1.7_Llama2Tokenizer/tulu_flan-0000_text_document
      [2024-09-16 22:59:40.086224][INFO][utils:326] -  > building dataset index ...
      [2024-09-16 22:59:40.085475][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[27297, 29924,   801,  ..., 28947, 29892,   470],
            [12542,  5568,   703,  ...,   426,    13,  4706],
            [ 6907,   800,   322,  ..., 29892,  1661, 30304],
            [29900, 13630,   293,  ..., 26552,   363,   975]])
      [2024-09-16 22:59:40.958071][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[   13, 29946, 29953,  ..., 29953, 29945, 29871],
            [ 1283, 16578,  1156,  ...,   408,  2215,   408],
            [29906,  4229,  7671,  ...,    13,  1576,  1014],
            [  526,  2898,   304,  ...,   471,  4802, 29991]])
    • Iteration 4:

      [2024-09-16 22:59:42.028460][INFO][training_log:661] -  iteration=       4/  635782 | consumed_samples=        3072 | consumed_tokens=    12582912 | elapsed_time_per_iteration_ms=7147.0 | learning_rate=3.77488e-08 | global_batch_size=768 | lm loss=11.171233 | loss_scale=1.0 | grad_norm=6.272 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=107.458 | tokens_per_gpu_per_second_tgs=18339.524 | [LM]TFLOPs=84.00 | [DS]TFLOPs=108.31 |
      (min, max) time across ranks (ms):
        forward-backward ...............................: (7091.77, 7092.56)
        optimizer ......................................: (39.21, 40.11)
      [2024-09-16 22:59:42.035716][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  443,   666, 10170,  ...,   278,   619,  7323],
            [   13, 11008,   338,  ...,  2472,   363, 22049],
            [29871,    13, 29938,  ..., 29962,  8521, 29896],
            [ 1165,  2280,   304,  ...,   306,   471,  2086]])
      [2024-09-16 22:59:42.860756][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  304,   679,   304,  ...,  1475, 29889,  1570],
            [ 2184, 29936,    13,  ...,  4706,   970,  1780],
            [29872,   352, 29901,  ..., 29905,  4915, 29912],
            [16809,   304,  1438,  ..., 13457, 29889,    13]])
      [2024-09-16 22:59:43.731208][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[12015, 29901, 20549,  ...,   322, 10752, 17906],
            [  372, 30010, 29879,  ..., 29892, 14595,   653],
            [18280, 29958,    13,  ..., 18884,   736,  6251],
            [29889,    13,    13,  ...,   599,   373, 17097]])
      [2024-09-16 22:59:44.605047][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 3367,   567,   964,  ...,  3353, 24870,  2181],
            [ 1262,  2609,   367,  ..., 29974, 29896,  7570],
            [29871, 29941, 29900,  ...,   341,   555,   265],
            [ 4225,   526,  6041,  ...,  1925,  1623,  2748]])
      [2024-09-16 22:59:45.479433][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 2283, 10162,  1496,  ..., 30656, 30317, 30605],
            [29879,  9228,   292,  ...,  7968, 29899,  7052],
            [  884,   599,   367,  ..., 29892,   278,  6054],
            [29879,   411,   278,  ...,   367,  5019,  1183]])
      [2024-09-16 22:59:46.351707][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 3867,   281,   761,  ..., 11949,   338,  4922],
            [  297,  1432,  2586,  ...,  5414,   278, 29811],
            [29892,   278, 15562,  ..., 10296,   310,   394],
            [ 1451,  2960,  3505,  ...,   657, 14346,  8003]])
      [2024-09-16 22:59:47.222016][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[ 6601,  2874,   414,  ...,   302,   317,  7390],
            [16415,   297,  5146,  ...,   763,   372,   471],
            [29941, 29906,  1118,  ..., 29900, 29889, 29953],
            [ 4893,   304,  4808,  ...,  2284,  2164, 18690]])
      [2024-09-16 22:59:48.094752][INFO][pretrain_gpt_alcf:198] - args.iteration=4: data['text'][:10]=tensor([[  901,   310,  2994,  ..., 29873,  1641,   766],
            [  304,  1716,  2562,  ...,  3489,   304,   367],
            [ 1949,  6736, 29871,  ..., 29965, 29909,   353],
            [   13,    13, 29930,  ..., 16497,   316,   474]])
  • Skipping steps [2, 3]:

    tokens:
    • Iteration 0:

      [2024-09-16 23:08:47.749839][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  304,  7344,  5146,  ...,  9776, 29914, 26419],
        [29889,    13,  4706,  ...,  9280, 30004,    13],
        [29943, 20774, 29908,  ...,   304, 27391,   322],
        [ 2645,   445, 29871,  ..., 16888,  4656, 10070]])
      [2024-09-16 23:08:51.451183][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 2768,   596,  1788,  ..., 27274,   393, 30010],
        [  278,  5613,  4192,  ...,   362,   310,  1950],
        [28038, 29892,  2022,  ...,  3160,   278,  2087],
        [ 4149,   907, 29888,  ..., 29896, 29892, 29896]])
      [2024-09-16 23:08:54.073597][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  424,   322, 16232,  ...,   366,   748,   467],
        [   13,   462,  1678,  ...,  2084, 29892,  3497],
        [ 7562,   310, 19320,  ...,  8973, 22684,   358],
        [ 2089,  3633,   292,  ..., 13774,   269,  2375]])
      [2024-09-16 23:08:56.212476][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[21411,   322,  3896,  ...,  2610, 29889,   319],
        [ 8003, 29898, 29900,  ...,    12,  6658,   529],
        [  278,  4148,   310,  ...,   263, 12212,   282],
        [ 5977, 29871, 29906,  ..., 15332,   310,  1749]])
      [2024-09-16 23:08:57.207940][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  278,  1473, 24987,  ...,   263,  2217,  3804],
        [ 2973,   263, 18778,  ...,   263,  4642,  6673],
        [  309,   323,   804,  ...,  1063, 15296,   327],
        [  278,  5864,   322,  ...,  9409, 29889,  2178]])
      [2024-09-16 23:08:58.083935][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[29892, 13731,  6617,  ..., 29871, 29896, 29946],
        [ 2892,  1012,  1266,  ...,  4036,  7512,  2068],
        [ 1473,  1556,  3619,  ...,  3762,   338,   263],
        [23353, 29918,  2177,  ...,   501,   567,   814]])
      [2024-09-16 23:08:58.951793][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[ 5400, 14378,  4768,  ...,  2107, 18677, 29889],
        [ 9200, 29887, 29914,  ...,   293, 24235,   322],
        [30143,  4746,  2184,  ..., 11891, 29974, 25760],
        [19263, 29914,   303,  ...,   358, 29889,    13]])
      [2024-09-16 23:08:59.820234][INFO][pretrain_gpt_alcf:198] - args.iteration=0: data['text'][:10]=tensor([[  309,  1306,   681,  ...,   310, 23186, 21809],
        [29896, 29929,    13,  ..., 29871, 29900,    13],
        [ 9558,   964,   263,  ...,   322,   282,   682],
        [  278, 23904, 21767,  ...,   313, 29929, 29889]])
    • Iteration 1:

      [2024-09-16 23:09:08.943867][INFO][training_log:661] -  iteration=       1/  635782 | consumed_samples=         768 | consumed_tokens=     3145728 | elapsed_time_per_iteration_ms=21224.4 | learning_rate=9.4372e-09 | global_batch_size=768 | lm loss=11.167250 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  0 | number_of_nan_iterations=  0 | samples_per_second=36.185 | tokens_per_gpu_per_second_tgs=6175.523 | [LM]TFLOPs=28.29 | [DS]TFLOPs=36.47 |
      [2024-09-16 23:09:08.953432][INFO][training:1083] - iteration=1 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1472, 29892,   408,  ..., 29892,  1584,   363],
        [  967, 19475,  6593,  ...,  8093, 29899, 11249],
        [ 1006,  2218, 13326,  ...,  2355,  1304,   304],
        [29900, 29916, 29947,  ...,   353,  1870, 29936]])
      [2024-09-16 23:09:08.957524][INFO][training:1083] - iteration=1 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 2354,   274,  1041,  ..., 29892, 13049,  9098],
        [ 8798,  9547, 10353,  ...,   303,  3143, 29889],
        [ 1373,  4056,  7236,  ...,  3186,   297,  5837],
        [ 1738, 29920,  7355,  ...,    13, 29871,  3776]])
      [2024-09-16 23:09:08.966648][INFO][training:1083] - iteration=1 [2/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 2020,   306,  1016,  ...,   322,   920,   372],
        [ 5921,  1749,  7306,  ..., 19252,   297,  5664],
        [  970,   770, 28547,  ...,   970,   894,  2577],
        [ 1907,   363, 14188,  ...,   756,  3646,   287]])
      [2024-09-16 23:09:08.969989][INFO][training:1083] - iteration=1 [3/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  715, 25392,  3104,  ...,   289,  5761,   616],
        [  426,    13,  9651,  ...,  9651,  1815, 22603],
        [ 7714,  1213,    13,  ...,    13, 29876,   457],
        [29889, 28663,  1230,  ...,  1546,   278,  6586]])
      [2024-09-16 23:09:08.990736][INFO][training:1083] - iteration=1 [4/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[29929,    13,    13,  ..., 10739,  4770, 11277],
        [ 4528,   304,  2367,  ...,  2501,   385,  4203],
        [  869,   319,   794,  ...,  3158, 29889,  3115],
        [  592,   260,  4125,  ...,   284,  1135, 18655]])
      [2024-09-16 23:09:08.993101][INFO][training:1083] - iteration=1 [5/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[14338, 25323,  3321,  ...,  5607,  1806,  1164],
        [  322,   278, 15352,  ...,  6462,   313,  1552],
        [25738,   714, 29889,  ..., 29915, 29879, 24842],
        [ 5122,   399, 29889,  ..., 29947,  7284,  2305]])
      [2024-09-16 23:09:09.036896][INFO][training:1083] - iteration=1 [6/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  367, 19310,  1891,  ...,  2408,   292,   263],
        [  470,  3307,  5713,  ...,   568,  2594, 19385],
        [29953, 29905,  1631,  ...,  1118,   343, 29897],
        [10261,   373,  5490,  ...,   511,   297,  1760]])
      [2024-09-16 23:09:09.039401][INFO][training:1083] - iteration=1 [7/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1006,   326, 29901,  ..., 14834,  6694,  9595],
        [12058,  5446, 29892,  ..., 29889,  8246,  3310],
        [ 7483,   310,   278,  ...,   402,  9851,  4423],
        [ 8041,   813,   322,  ...,  3303,  3900,   393]])
    • Iteration 2:

      [2024-09-16 23:09:09.050766][INFO][training_log:661] -  iteration=       2/  635782 | consumed_samples=        1536 | consumed_tokens=     6291456 | elapsed_time_per_iteration_ms=106.8 | learning_rate=1.88744e-08 | global_batch_size=  768 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations=  1 | number_of_nan_iterations=  0 | samples_per_second=7190.781 | tokens_per_gpu_per_second_tgs=1227226.651 | [LM]TFLOPs=5620.92 | [DS]TFLOPs=7247.49 |
      [2024-09-16 23:09:09.055864][INFO][training:1069] - Caught 3 in 'ranges_to_skip', skipping!
      [2024-09-16 23:09:09.057929][INFO][training:1082] - torch.Size([4, 4097]), len(train_data_iterator)=490723200
      [2024-09-16 23:09:09.059118][INFO][training:1083] - iteration=2 [0/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 1858,  3851, 29889,  ...,   500,    13,    13],
      [  349,  6156,  1650,  ...,  5806, 28557,  3519],
      [16554,   304,  1653,  ...,   322,  6934, 14722],
      [ 4955,   310, 10465,  ...,  1438,  3841, 29892]])
      [2024-09-16 23:09:09.061999][INFO][training:1083] - iteration=2 [1/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  363,  1302, 16453,  ...,  7967, 29891,   484],
      [  367,   766,  4752,  ...,     1, 29871, 30143],
      [29899,   855,  1503,  ...,  3786, 29892,  5100],
      [  465,  1974,   289,  ..., 21588,   533,   304]])
      [2024-09-16 23:09:09.065494][INFO][training:1083] - iteration=2 [2/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[29889,    13,  4806,  ...,  3086, 26040,  9220],
      [  293,  7207,   355,  ..., 18131,   520,  1247],
      [ 8619, 29889, 29871,  ...,   304, 10029,   266],
      [  363, 15202, 29892,  ...,   482, 17162, 19104]])
      [2024-09-16 23:09:09.069035][INFO][training:1083] - iteration=2 [3/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[25561,   411,   278,  ...,   297,  2898, 26163],
      [22574,  2607, 18134,  ...,    13,  4706,   500],
      [20190, 24820,  1623,  ...,   310,   901, 29892],
      [29892,  1951,  4486,  ...,   869,   887, 30010]])
      [2024-09-16 23:09:09.072577][INFO][training:1083] - iteration=2 [4/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[ 5371, 22417, 29892,  ...,    13,  6716,   901],
      [  353,  1565, 29936,  ..., 29878,  3567,  7196],
      [17296,   338,  1985,  ...,  3741,  9089,   422],
      [  694, 13331,   310,  ..., 21180, 29892,   607]])
      [2024-09-16 23:09:09.075789][INFO][training:1083] - iteration=2 [5/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  292,  8818,   267,  ..., 29892, 11275,  7407],
      [ 1870, 29897,    13,  ...,  2697, 29901,    13],
      [29913,   338,   263,  ..., 29892,   591,  3394],
      [ 2253,   472,  1554,  ...,   982,   304,   376]])
      [2024-09-16 23:09:09.079052][INFO][training:1083] - iteration=2 [6/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  391,  2598, 29883,  ..., 22629,   346,   440],
      [29871, 29896, 29906,  ...,   407,   583,  2833],
      [ 4262,  1836,    13,  ...,   310,   263, 10608],
      [ 1199,   411, 24770,  ...,   272,  2153, 29889]])
      [2024-09-16 23:09:09.082739][INFO][training:1083] - iteration=2 [7/8]: (torch.Size([4, 4097]))
      _tokens[:10]=tensor([[  620, 20503,   428,  ...,   297,  1009,  9443],
      [  950, 25078,   892,  ...,   408, 10636,   284],
      [ 1012,  2003,   364,  ...,  7313, 29912, 19303],
      [29906, 29892, 29945,  ...,   967, 26414,   472]])
    • Iteration 3:

      [2024-09-16 23:09:09.135651][INFO][training_log:661] - iteration= 3/ 635782 | consumed_samples= 2304 | consumed_tokens= 9437184 | elapsed_time_per_iteration_ms=84.7 | learning_rate=2.83116e-08 | global_batch_size= 768 | loss_scale=1.0 | grad_norm=6.363 | actual_seqlen= 4096 | number_of_skipped_iterations= 1 | number_of_nan_iterations= 0 | samples_per_second=9070.783 | tokens_per_gpu_per_second_tgs=1548080.271 | [LM]TFLOPs=7090.49 | [DS]TFLOPs=9142.31 |
      [2024-09-16 23:09:09.143511][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[496,   313, 29941,  ...,  1316,   408,  4857],
        [29899,  3204, 29889,  ...,  1074,   330,  2547],
        [29916, 29900, 29946,  ..., 18455, 29889,  4002],
        [26406,   338,  1641,  ...,   670,  1914,  6900]])
      [2024-09-16 23:09:09.971988][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29945, 29900,   867,  ...,  7601, 12091,   310],
        [  975, 29871, 29896,  ...,  3573,   825,   306],
        [29906, 29900,  4638,  ..., 29227, 23145, 29892],
        [  278, 14368,   322,  ..., 14909, 29936, 25913]])
      [2024-09-16 23:09:10.843966][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[988,   306,  1033,  ...,   437,   408,  1532],
        [  450, 10317,   310,  ...,   322,   752, 13036],
        [11405,  8020, 29889,  ...,   471, 18096,   287],
        [  288,  3594, 19284,  ...,   910,   338,   385]])
      [2024-09-16 23:09:11.715513][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[322, 15151, 29946,  ..., 11648,  1497, 29889],
        [24233,   362,   467,  ...,  4513,  1353,   322],
        [ 3311, 13605, 29912,  ...,   945, 29899,  4181],
        [ 1951,   366,   508,  ...,  6589,   491,   777]])
      [2024-09-16 23:09:12.584136][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[29889, 29900,    13,  ...,  6017,   424,  1711],
        [  297,  5500,  1489,  ...,   310,  3802,  7875],
        [ 8078,  5314,   515,  ...,   373,   278,  6991],
        [13763,  6204,  6359,  ...,  4706,  2024,  1347]])
      [2024-09-16 23:09:13.450767][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[13,  4806,  3512,  ...,   278,  7824,  6438],
        [ 2294,   938,   903,  ...,  4537,  3047,   449],
        [ 1230,  4123,   767,  ...,   310,   963, 21003],
        [ 1152,  2319, 10365,  ...,   367, 14040,   363]])
      [2024-09-16 23:09:14.317517][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[27297, 29924,   801,  ..., 28947, 29892,   470],
        [12542,  5568,   703,  ...,   426,    13,  4706],
        [ 6907,   800,   322,  ..., 29892,  1661, 30304],
        [29900, 13630,   293,  ..., 26552,   363,   975]])
      [2024-09-16 23:09:15.187191][INFO][pretrain_gpt_alcf:198] - args.iteration=3: data['text'][:10]=tensor([[13, 29946, 29953,  ..., 29953, 29945, 29871],
        [ 1283, 16578,  1156,  ...,   408,  2215,   408],
        [29906,  4229,  7671,  ...,    13,  1576,  1014],
        [  526,  2898,   304,  ...,   471,  4802, 29991]])