Ryan
Ryan
Also encountering this issue. Alternative might be to set the env variable NVIDIA_VISIBLE_DEVICES instead of passing through --gpus. Although it would be nice to be able to know how escaping...
For example, if I want component level behavior specific to local schedulers (local, docker), I could for example add an additional redundant parameter --local in the component to specifiy that...
In a similar vein, adding all container insight metrics to monitorEc2Service would be very helpful
Merged in https://github.com/NVIDIA/Megatron-LM/commit/a30a28dbe9063e8456ddc2f5ee1d26ede8589f63 Can mark as closed, thanks
@dimapihtar @ericharper Same issue occurs when trying to load the distributed checkpoint for continued training / sft. Loading distributed checkpoint with a single A100 works fine, with gbs=1,tp=1,pp=1,mbs=1. When scaling...
Some debug logs as well, let me know if anything else could be useful to include: ``` > rank_sharing[0][1] [(0, ShardedTensor(key='model.embedding.word_embeddings.weight', data=None, dtype=torch.float32, local_shape=(50304, 768), global_shape=(50304, 768), global_offset=(0, 0), axis_fragmentations=(1,...
Seem to have figured out the root cause. ### Background When loading from distributed checkpoint, we call `NLPModel.load_from_checkpoint(..)` [ref](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/nlp_model.py#L309) For distributed checkpoint, loading the state_dict gets [deferred until the class...
any updates on this issue?
@yaox12 Unsure if related, but also notice similar slowdowns in GroupedLinear. a trace on the execution shows intermittently there's a slowdown on the [torch.split call](https://github.com/NVIDIA/TransformerEngine/blob/e17fab14d0ce504627a9b773d70e41b5ba407699/transformer_engine/pytorch/module/grouped_linear.py#L94)
No, zombie processes are just a side effect of the docker init process not automatically cleaning up processes when they terminate. For example, when manually starting container, one might use...