Jiayi Yan comments

Results 10 comments of


                                            Jiayi Yan

[QUESTION] found NaN in local grad norm in backward pass before data-parallel communication collective

same issue

TypeError: get_cpu_offload_context() takes from 0 to 4 positional arguments but 5 were given

PR #996

fix _te_version issue in transformer_engine.py get_cpu_offload_context()

supplement: branch release_v1.9 of TransformerEngine get_cpu_offload_context() https://github.com/NVIDIA/TransformerEngine/blob/ba36f90d05c203787294b7e490af901d79f07d30/transformer_engine/pytorch/cpu_offload.py#L482 branch main(version 1.10.0.dev0) of TransformerEngine get_cpu_offload_context() https://github.com/NVIDIA/TransformerEngine/blob/def4d1cbfd24e4bb28608d045634a817f638abb7/transformer_engine/pytorch/cpu_offload.py#L438

fix _te_version issue in transformer_engine.py get_cpu_offload_context()

@akoumpa

[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te

this bug also happened when I run [examples/inference/run_text_generation_server_345M.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/inference/run_text_generation_server_345M.sh)

[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te

That's the reason, thank for your reply. Hope it can be fixed quickly

[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te

I have checked the source code of transformer-engine, 1.8.0 in the code should be modified to 1.9.0. ```python from transformer_engine.pytorch.cpu_offload import ( get_cpu_offload_context as _get_cpu_offload_context, ) def get_cpu_offload_context( enabled, num_layers,...

Distributed Mamba Training

> This runs training on 8 GPUs: https://github.com/NVIDIA/Megatron-LM/blob/ssm/examples/mamba/train.sh. You can extend to multi-node by passing in appropriate arguments to `torchrun` (adapted from https://pytorch.org/docs/stable/elastic/run.html#usage): > > ``` > torchrun > --nnodes=$NUM_NODES...

stuck at building wheel

Hm, I'd expect most systems could handle building with `MAX_JOBS=1`. I wonder if we could get more clues if you build with verbose output (`pip install -v -v .`). _Originally...

[BUG] `run_simple_mcore_train_loop.py` bugs when moditied `tensor_model_parallel_size` from `2` to `1`

> Do you have any solution? I got the same error. I think this bug is due to inappropriate default `max_sequence_length` in `MockGPTLowLevelDataset`, which is used to generate mockdataset. https://github.com/NVIDIA/Megatron-LM/blob/732a689606810c02d0dc260a163c9ebac099c044/megatron/core/datasets/gpt_dataset.py#L693-L697...