Ryan comments

Results 17 comments of


                                            Ryan

Specifying specific GPUs for the devcontainer.

Also encountering this issue. Alternative might be to set the env variable NVIDIA_VISIBLE_DEVICES instead of passing through --gpus. Although it would be nice to be able to know how escaping...

Determine scheduler from component level

For example, if I want component level behavior specific to local schedulers (local, docker), I could for example add an additional redundant parameter --local in the component to specifiy that...

ECS monitoring for disk space

In a similar vein, adding all container insight metrics to monitorEc2Service would be very helpful

Support S3 data loading

Merged in https://github.com/NVIDIA/Megatron-LM/commit/a30a28dbe9063e8456ddc2f5ee1d26ede8589f63 Can mark as closed, thanks

Exception running inference with MCore Distributed Checkpoint with different TP setting than training

@dimapihtar @ericharper Same issue occurs when trying to load the distributed checkpoint for continued training / sft. Loading distributed checkpoint with a single A100 works fine, with gbs=1,tp=1,pp=1,mbs=1. When scaling...

Exception running inference with MCore Distributed Checkpoint with different TP setting than training

Some debug logs as well, let me know if anything else could be useful to include: ``` > rank_sharing[0][1] [(0, ShardedTensor(key='model.embedding.word_embeddings.weight', data=None, dtype=torch.float32, local_shape=(50304, 768), global_shape=(50304, 768), global_offset=(0, 0), axis_fragmentations=(1,...

Exception running inference with MCore Distributed Checkpoint with different TP setting than training

Seem to have figured out the root cause. ### Background When loading from distributed checkpoint, we call `NLPModel.load_from_checkpoint(..)` [ref](https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/nlp_model.py#L309) For distributed checkpoint, loading the state_dict gets [deferred until the class...

Exception running inference with MCore Distributed Checkpoint with different TP setting than training

any updates on this issue?

Question about the performace of GroupedLinear

@yaox12 Unsure if related, but also notice similar slowdowns in GroupedLinear. a trace on the execution shows intermittently there's a slowdown on the [torch.split call](https://github.com/NVIDIA/TransformerEngine/blob/e17fab14d0ce504627a9b773d70e41b5ba407699/transformer_engine/pytorch/module/grouped_linear.py#L94)

Connecting to long running Dev Container takes extremely long time

No, zombie processes are just a side effect of the docker init process not automatically cleaning up processes when they terminate. For example, when manually starting container, one might use...