Jiayi Yan
Jiayi Yan
PR #996
supplement: branch release_v1.9 of TransformerEngine get_cpu_offload_context() https://github.com/NVIDIA/TransformerEngine/blob/ba36f90d05c203787294b7e490af901d79f07d30/transformer_engine/pytorch/cpu_offload.py#L482 branch main(version 1.10.0.dev0) of TransformerEngine get_cpu_offload_context() https://github.com/NVIDIA/TransformerEngine/blob/def4d1cbfd24e4bb28608d045634a817f638abb7/transformer_engine/pytorch/cpu_offload.py#L438
this bug also happened when I run [examples/inference/run_text_generation_server_345M.sh](https://github.com/NVIDIA/Megatron-LM/blob/main/examples/inference/run_text_generation_server_345M.sh)
That's the reason, thank for your reply. Hope it can be fixed quickly
I have checked the source code of transformer-engine, 1.8.0 in the code should be modified to 1.9.0. ```python from transformer_engine.pytorch.cpu_offload import ( get_cpu_offload_context as _get_cpu_offload_context, ) def get_cpu_offload_context( enabled, num_layers,...
> This runs training on 8 GPUs: https://github.com/NVIDIA/Megatron-LM/blob/ssm/examples/mamba/train.sh. You can extend to multi-node by passing in appropriate arguments to `torchrun` (adapted from https://pytorch.org/docs/stable/elastic/run.html#usage): > > ``` > torchrun > --nnodes=$NUM_NODES...
Hm, I'd expect most systems could handle building with `MAX_JOBS=1`. I wonder if we could get more clues if you build with verbose output (`pip install -v -v .`). _Originally...
> Do you have any solution? I got the same error. I think this bug is due to inappropriate default `max_sequence_length` in `MockGPTLowLevelDataset`, which is used to generate mockdataset. https://github.com/NVIDIA/Megatron-LM/blob/732a689606810c02d0dc260a163c9ebac099c044/megatron/core/datasets/gpt_dataset.py#L693-L697...