SkanderBS2024
SkanderBS2024
Any updates about the conversion ? the [Nemo](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/checkpoints/convert_mlm.html) converter does not support distcp format it uses the legacy format apparently [Code](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_lm_ckpt_to_nemo.py#L126)
Usefull : [DCP to Torch](https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.format_utils.dcp_to_torch_save)
@zixianwang2022 library available only on linux. If trying to install on windows or macos you'll encouter this error
@zixianwang2022 try running commands manually without the dockerfile after pulling the nvidia image and running the docker container.
@tbsxxxH I've temporarily hard coded the paths here : https://github.com/NVIDIA/Megatron-LM/blob/c873429cbaa43257d4d4fc01df2a7a50453b7984/megatron/training/tokenizer/tokenizer.py#L38-L40
remove the assertions and declare a variable for each path and give them as parameter to "_GPT2BPETokenizer"
I meant multi-nodes*, so multi-nodes (--nnodes) and gpu_count (-nproc-per-node). Thank you !
@deepakn94 is it possible to set up the gpus dynamically during training ? (for example i have a total of 180 GPU's 90 of them are fixed for the whole...
I'll take a look thank you !
hello @JRD971000 , yep i worked with the nvcr.io/nvidia/nemo:24.07 container and everything worked fine thank you for your response.