From NVIDIA Megatron-LM for visibility #18

RaymondLi0 · 2023-01-24T20:01:13Z

No description provided.

Remove extra barrier in checkpoint flow See merge request ADLR/megatron-lm!3626

Fix error when TE is not installed See merge request ADLR/megatron-lm!3625

…itializations and associated weight decay skipping.

Adding support for Spike No More embedding initializations and associated weight decay skipping. See merge request ADLR/megatron-lm!3500

MiMo video VLM train example See merge request ADLR/megatron-lm!3543

ci: Retry on `free(): invalid pointer` See merge request ADLR/megatron-lm!3632

Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: William Dykas <wdykas@cs-oci-ord-login-02.cm.cluster>

Add Dynamic Backend Inference Tests See merge request ADLR/megatron-lm!3475

…ate loading with PP>1 to ensure bit-wise match after saving and loading.

fix(distckpt, moe): Fix distckpt optimizer state loading with PP>1 to ensure bit-wise match after saving and loading. See merge request ADLR/megatron-lm!3394

tests: Fix segfaults (maybe?) See merge request ADLR/megatron-lm!3605

Co-authored-by: liu-zichen <liuzichen@dbis.nankai.edu.cn>

Fix mrope with context parallel See merge request ADLR/megatron-lm!3603

…rev_bwd_hidden_state_inputgrad for layer N-1

Bugfix: last layer N in backward should set prev_bwd_hidden_state_inputgrad for layer N-1 See merge request ADLR/megatron-lm!3645

Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com>

fix(mla): Fix an issue with seq-packing + mla See merge request ADLR/megatron-lm!3467

…interface Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: root <root@eos0502.eos.clusters.nvidia.com>

ModelCommProcessGroup integration into model interface See merge request ADLR/megatron-lm!3391

…or calculation

fix(mla): use mscale_all_dim for softmax_factor calculation See merge request ADLR/megatron-lm!2800

Use ruff linter See merge request ADLR/megatron-lm!3627

Remove FP8 calibration script See merge request ADLR/megatron-lm!3861

…d outside of the linear layer Co-authored-by: Chenjie Luo <chenjiel@nvidia.com>

Fix NEMO unittest where the weight is provided outside of the linear layer See merge request ADLR/megatron-lm!3828

add wandb_entity See merge request ADLR/megatron-lm!3864

…r DistOpt Co-authored-by: Slawek Kierat <skierat@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Implement new optimizer checkpoint formats for DistOpt See merge request ADLR/megatron-lm!3532

build: Bump packaging See merge request ADLR/megatron-lm!3876

Signed-off-by: oliver könig <okoenig@nvidia.com>

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

chore: Upgrade dependencies (2025-08-25) See merge request ADLR/megatron-lm!3887

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

ci: Auto-publish megatron-fsdp See merge request ADLR/megatron-lm!3875

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Jon Barker <jbarker@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: Robert Kirby <rkirby@cw-dfw-cs-001-vscode-01.cm.cluster>

[1/4] Merge Megatron-RL into LM See merge request ADLR/megatron-lm!3646

…ze_model_parallel

Fix unsetting NCCL_COLLNET_ENABLE in initialize_model_parallel See merge request ADLR/megatron-lm!3884

…pping - (03) Support EP A2A overlap for interleaved PP and MTP Co-authored-by: Pingtian Li <pingtianl@nvidia.com>

feat(MoE): Support Expert Parallel A2A Overlapping - (03) Support EP A2A overlap for interleaved PP and MTP See merge request ADLR/megatron-lm!3074

Move cuda graph capture to core See merge request ADLR/megatron-lm!3782

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

tests: Auto-validate weekly tests See merge request ADLR/megatron-lm!3764

Signed-off-by: oliver könig <okoenig@nvidia.com>

RaymondLi0 changed the base branch from multi-query-attention to before-merge June 20, 2023 20:12

RaymondLi0 changed the base branch from before-merge to multi-query-attention June 20, 2023 20:12

deepakn94 and others added 28 commits July 15, 2025 12:26

Merge branch 'ananthsub-remove-extra-barrier' into 'main'

26869fe

Remove extra barrier in checkpoint flow See merge request ADLR/megatron-lm!3626

ADLR/megatron-lm!3625 - Fix error when TE is not installed

76203e7

Merge branch 'fix_te_not_installed' into 'main'

e6c510f

Fix error when TE is not installed See merge request ADLR/megatron-lm!3625

ADLR/megatron-lm!3500 - Adding support for Spike No More embedding in…

479a42e

…itializations and associated weight decay skipping.

Merge branch 'jstjohn/spike-no-more' into 'main'

ee74aa6

Adding support for Spike No More embedding initializations and associated weight decay skipping. See merge request ADLR/megatron-lm!3500

ADLR/megatron-lm!3543 - MiMo video VLM train example

ead789f

Merge branch 'yash/mimo_video_llava_mr' into 'main'

786f562

MiMo video VLM train example See merge request ADLR/megatron-lm!3543

ADLR/megatron-lm!3632 - ci: Retry on free(): invalid pointer

3b173e5

Merge branch 'ko3n1g/ci/retry-on-invalid-pointer' into 'main'

ec35b41

ci: Retry on `free(): invalid pointer` See merge request ADLR/megatron-lm!3632

tests(hotfix): mimo_vlm_pretrain_convergence_tp1_pp1_cp1_dp8

f8d3e2e

Signed-off-by: oliver könig <okoenig@nvidia.com>

ADLR/megatron-lm!3475 - Add Dynamic Backend Inference Tests

3c20092

Co-authored-by: Keshav Santhanam <ksanthanam@nvidia.com> Co-authored-by: William Dykas <wdykas@cs-oci-ord-login-02.cm.cluster>

Merge branch 'dynamic-inference-tests' into 'main'

a84756a

Add Dynamic Backend Inference Tests See merge request ADLR/megatron-lm!3475

ADLR/megatron-lm!3394 - fix(distckpt, moe): Fix distckpt optimizer st…

9a45638

…ate loading with PP>1 to ensure bit-wise match after saving and loading.

Merge branch 'fix_dist_ckpt_optim_load_bug' into 'main'

eb5ff48

fix(distckpt, moe): Fix distckpt optimizer state loading with PP>1 to ensure bit-wise match after saving and loading. See merge request ADLR/megatron-lm!3394

ADLR/megatron-lm!3605 - tests: Fix segfaults (maybe?)

46b35f5

Merge branch 'ko3n1g/tests/segfaults' into 'main'

f6492fe

tests: Fix segfaults (maybe?) See merge request ADLR/megatron-lm!3605

ADLR/megatron-lm!3603 - Fix mrope with context parallel

28778d1

Co-authored-by: liu-zichen <liuzichen@dbis.nankai.edu.cn>

Merge branch 'lit/fix_mope_with_cp' into 'main'

015542d

Fix mrope with context parallel See merge request ADLR/megatron-lm!3603

ADLR/megatron-lm!3645 - Bugfix: last layer N in backward should set p…

f9e6b94

…rev_bwd_hidden_state_inputgrad for layer N-1

Merge branch 'helenn-cudagraphs-backward-offby1' into 'main'

ee61cf4

Bugfix: last layer N in backward should set prev_bwd_hidden_state_inputgrad for layer N-1 See merge request ADLR/megatron-lm!3645

ADLR/megatron-lm!3467 - fix(mla): Fix an issue with seq-packing + mla

8e44748

Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com>

Merge branch 'yuya/mla_seq_packing_fix' into 'main'

53ef953

fix(mla): Fix an issue with seq-packing + mla See merge request ADLR/megatron-lm!3467

ADLR/megatron-lm!3391 - ModelCommProcessGroup integration into model …

9faeb2f

…interface Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: root <root@eos0502.eos.clusters.nvidia.com>

Merge branch 'pmannan/model_interface_pg' into 'main'

26adc2d

ModelCommProcessGroup integration into model interface See merge request ADLR/megatron-lm!3391

ADLR/megatron-lm!2800 - fix(mla): use mscale_all_dim for softmax_fact…

62f0a97

…or calculation

Merge branch 'fix_softmax_factor_cal' into 'main'

e96a358

fix(mla): use mscale_all_dim for softmax_factor calculation See merge request ADLR/megatron-lm!2800

ADLR/megatron-lm!3627 - Use ruff linter

565d9ad

Merge branch 'maanug/use-ruff-lint' into 'main'

f36e170

Use ruff linter See merge request ADLR/megatron-lm!3627

jaredcasper and others added 30 commits August 23, 2025 13:18

Merge branch 'helenn-delete-calibration' into 'main'

bebc0e4

Remove FP8 calibration script See merge request ADLR/megatron-lm!3861

ADLR/megatron-lm!3828 - Fix NEMO unittest where the weight is provide…

ed1eaa9

…d outside of the linear layer Co-authored-by: Chenjie Luo <chenjiel@nvidia.com>

Merge branch 'chenjiel/fix_nemo' into 'main'

312f300

Fix NEMO unittest where the weight is provided outside of the linear layer See merge request ADLR/megatron-lm!3828

ADLR/megatron-lm!3864 - add wandb_entity

c40a446

Merge branch 'entity' into 'main'

f6a675a

add wandb_entity See merge request ADLR/megatron-lm!3864

ADLR/megatron-lm!3532 - Implement new optimizer checkpoint formats fo…

3d19693

…r DistOpt Co-authored-by: Slawek Kierat <skierat@nvidia.com> Co-authored-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'mblaz/dp_zero_model_space' into 'main'

3d784cb

Implement new optimizer checkpoint formats for DistOpt See merge request ADLR/megatron-lm!3532

ADLR/megatron-lm!3876 - build: Bump packaging

1c29678

Merge branch 'ko3n1g/ci/packaging' into 'main'

f364164

build: Bump packaging See merge request ADLR/megatron-lm!3876

ci(hotfix): Increase n_nondeterminism_attemps

c7fd91a

Signed-off-by: oliver könig <okoenig@nvidia.com>

chore: Version bump

4840669

ci(hotfix): Restart on zmq error

f8f6e9b

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci(hotfix): Increase non-determinism attempts

188435a

Signed-off-by: oliver könig <okoenig@nvidia.com>

ADLR/megatron-lm!3887 - chore: Upgrade dependencies (2025-08-25)

66c12ce

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-25' into 'main'

77eaa9a

chore: Upgrade dependencies (2025-08-25) See merge request ADLR/megatron-lm!3887

ADLR/megatron-lm!3875 - ci: Auto-publish megatron-fsdp

875ad2a

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ko3n1g/ci/push-megatron-fsdp' into 'main'

dcf7d36

ci: Auto-publish megatron-fsdp See merge request ADLR/megatron-lm!3875

ADLR/megatron-lm!3646 - [1/4] Merge Megatron-RL into LM

a6c6250

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com> Co-authored-by: Jon Barker <jbarker@cw-pdx-cs-001-vscode-02.cm.cluster> Co-authored-by: Robert Kirby <rkirby@cw-dfw-cs-001-vscode-01.cm.cluster>

Merge branch 'tdene/push-to-upstream-mr1' into 'main'

16c0d28

[1/4] Merge Megatron-RL into LM See merge request ADLR/megatron-lm!3646

ADLR/megatron-lm!3884 - Fix unsetting NCCL_COLLNET_ENABLE in initiali…

d6526b1

…ze_model_parallel

Merge branch 'chtruong/fix-unset-nccl-collnet' into 'main'

8db4323

Fix unsetting NCCL_COLLNET_ENABLE in initialize_model_parallel See merge request ADLR/megatron-lm!3884

ADLR/megatron-lm!3074 - feat(MoE): Support Expert Parallel A2A Overla…

d7bf5aa

…pping - (03) Support EP A2A overlap for interleaved PP and MTP Co-authored-by: Pingtian Li <pingtianl@nvidia.com>

Merge branch 'hongbinl/1f1b_overlap_new' into 'main'

4b30ec5

feat(MoE): Support Expert Parallel A2A Overlapping - (03) Support EP A2A overlap for interleaved PP and MTP See merge request ADLR/megatron-lm!3074

ADLR/megatron-lm!3782 - Move cuda graph capture to core

799cee0

Merge branch 'robinz/cudagraph_core' into 'main'

b7a6f90

Move cuda graph capture to core See merge request ADLR/megatron-lm!3782

ADLR/megatron-lm!3764 - tests: Auto-validate weekly tests

37ee3d1

Co-authored-by: Mcore Bot <mcore-bot@nvidia.com>

Merge branch 'ko3n1g/tests/thresholds-weekly' into 'main'

d6301fb

tests: Auto-validate weekly tests See merge request ADLR/megatron-lm!3764

ci: No integration tests on merge-trains

5b2cb28

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci: Allow interrupt on main

7b8bbf2

Signed-off-by: oliver könig <okoenig@nvidia.com>

ci(hotfix): Non-determinism only on EXIT_CODE=0

8efa2a0

Signed-off-by: oliver könig <okoenig@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

From NVIDIA Megatron-LM for visibility #18

From NVIDIA Megatron-LM for visibility #18

RaymondLi0 commented Jan 24, 2023

Uh oh!

Uh oh!

From NVIDIA Megatron-LM for visibility #18

Are you sure you want to change the base?

From NVIDIA Megatron-LM for visibility #18

Conversation

RaymondLi0 commented Jan 24, 2023

Uh oh!

Uh oh!