SkanderBS2024 comments

Results 10 comments of


                                            SkanderBS2024

[QUESTION] Asynchronous Checkpoint Saving

Any updates about the conversion ? the [Nemo](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/checkpoints/convert_mlm.html) converter does not support distcp format it uses the legacy format apparently [Code](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py#L126)

[QUESTION] Asynchronous Checkpoint Saving

Usefull : [DCP to Torch](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.format_utils.dcp_to_torch_save)

ERROR: Could not find a version that satisfies the requirement triton==2.1.0 (from versions: none) "MAMBA"

@zixianwang2022 library available only on linux. If trying to install on windows or macos you'll encouter this error

ERROR: Could not find a version that satisfies the requirement triton==2.1.0 (from versions: none) "MAMBA"

@zixianwang2022 try running commands manually without the dockerfile after pulling the nvidia image and running the docker container.

(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None

@tbsxxxH I've temporarily hard coded the paths here : https://github.com/NVIDIA/Megatron-LM/blob/c873429cbaa43257d4d4fc01df2a7a50453b7984/megatron/training/tokenizer/tokenizer.py#L38-L40

(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None

remove the assertions and declare a variable for each path and give them as parameter to "_GPT2BPETokenizer"

Distributed Mamba Training

I meant multi-nodes*, so multi-nodes (--nnodes) and gpu_count (-nproc-per-node). Thank you !

Distributed Mamba Training

@deepakn94 is it possible to set up the gpus dynamically during training ? (for example i have a total of 180 GPU's 90 of them are fixed for the whole...

Distributed Mamba Training

I'll take a look thank you !

" ValueError: max() arg is an empty sequence " while converting mamba 2 hybrid checkpoint to nemo

hello @JRD971000 , yep i worked with the nvcr.io/nvidia/nemo:24.07 container and everything worked fine thank you for your response.