Megatron-DeepSpeed
Megatron-DeepSpeed copied to clipboard
The given group does not exist pytorch
Do you know why i got this problem with pretrain_gpt_single_node.sh?
I'm setting N_GPUS=1
and got
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 191, in _get_group_rank
raise RuntimeError("The given group does not exist")
RuntimeError: The given group does not exist
from
Megatron-DeepSpeed/megatron/training.py", line 400, in setup_model_and_optimizer
model = get_model(model_provider_func)
i'm using NCG docker with pytorch and apex, deepspeed and other packages installed from you requirements.txt
my setup is 2x 3090
I also encountered this problem, did you solve the problem?
me too, how did you solved this problem?