Megatron-DeepSpeed icon indicating copy to clipboard operation
Megatron-DeepSpeed copied to clipboard

The given group does not exist pytorch

Open germanjke opened this issue 2 years ago • 2 comments

Do you know why i got this problem with pretrain_gpt_single_node.sh? I'm setting N_GPUS=1 and got

File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank
    raise RuntimeError("The given group does not exist")
RuntimeError: The given group does not exist

from

Megatron-DeepSpeed/megatron/training.py", line 400, in setup_model_and_optimizer
    model = get_model(model_provider_func)

i'm using NCG docker with pytorch and apex, deepspeed and other packages installed from you requirements.txt

my setup is 2x 3090

germanjke avatar Apr 25 '23 11:04 germanjke

I also encountered this problem, did you solve the problem?

LYF915 avatar May 25 '23 02:05 LYF915

me too, how did you solved this problem?

zql022 avatar Oct 24 '23 08:10 zql022