Jiuqiang Tang
Jiuqiang Tang
**Describe the bug** A segmentation fault occurs when MegatronCommOverlapCallback is initialized with tp_comm_overlap=True. This specific configuration is adopted from https://github.com/NVIDIA/NeMo/blob/19fadb67b09ba94c55094d34df119d6f9c565068/nemo/lightning/pytorch/callbacks/megatron_comm_overlap.py#L85. **Steps/Code to reproduce bug** Please list *minimal* steps or code...
**Describe the bug** When using Megatron's FSDP with an Hugging Face checkpoint for continual pretraining, the program crashes because of missing global shape metadata for an N-D flattened tensor. **Steps/Code...
**Describe the bug** The training step time remains constant for Gemma3 HFAutoModel and MockDataModule, regardless of the "global_batch_size" and "micro_batch_size" values set in MockDataModule. **Steps/Code to reproduce bug** The "gemma3_automodel_test.py":...