SkanderBS2024

Results 10 comments of SkanderBS2024

Any updates about the conversion ? the [Nemo](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/checkpoints/convert_mlm.html) converter does not support distcp format it uses the legacy format apparently [Code](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py#L126)

Usefull : [DCP to Torch](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.format_utils.dcp_to_torch_save)

@zixianwang2022 library available only on linux. If trying to install on windows or macos you'll encouter this error

@zixianwang2022 try running commands manually without the dockerfile after pulling the nvidia image and running the docker container.

@tbsxxxH I've temporarily hard coded the paths here : https://github.com/NVIDIA/Megatron-LM/blob/c873429cbaa43257d4d4fc01df2a7a50453b7984/megatron/training/tokenizer/tokenizer.py#L38-L40

remove the assertions and declare a variable for each path and give them as parameter to "_GPT2BPETokenizer"

I meant multi-nodes*, so multi-nodes (--nnodes) and gpu_count (-nproc-per-node). Thank you !

@deepakn94 is it possible to set up the gpus dynamically during training ? (for example i have a total of 180 GPU's 90 of them are fixed for the whole...

I'll take a look thank you !

hello @JRD971000 , yep i worked with the nvcr.io/nvidia/nemo:24.07 container and everything worked fine thank you for your response.