Jiuqiang Tang issues

Results 3 issues of


                                            Jiuqiang Tang

UsUsing MegatronCommOverlapCallback(tp_comm_overlap=True) causes segfault.

**Describe the bug** A segmentation fault occurs when MegatronCommOverlapCallback is initialized with tp_comm_overlap=True. This specific configuration is adopted from https://github.com/NVIDIA/NeMo/blob/19fadb67b09ba94c55094d34df119d6f9c565068/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L85. **Steps/Code to reproduce bug** Please list *minimal* steps or code...

bug

stale

Megatron FSDP doesn't work with HF checkpoint

**Describe the bug** When using Megatron's FSDP with an Hugging Face checkpoint for continual pretraining, the program crashes because of missing global shape metadata for an N-D flattened tensor. **Steps/Code...

bug

Adjusting "global_batch_size" and "micro_batch_size" has no impact on how long each training step takes when using HFAutoModel.

**Describe the bug** The training step time remains constant for Gemma3 HFAutoModel and MockDataModule, regardless of the "global_batch_size" and "micro_batch_size" values set in MockDataModule. **Steps/Code to reproduce bug** The "gemma3_automodel_test.py":...

bug