DDP multi node multi gpu inconsistent params

Open flckv opened this issue 2 years ago • 0 comments

System Info

### System Info


- Accelarate version:  0.19.0  
- OS: Ubuntu 20.04.4 LTS (x86_64)
- Python version: 3.11.3
- Numpy: 1.24.3
- PyTorch version: 2.0.1


- CUDA used to build PyTorch: 11.8
- Is CUDA available: True
- CUDA runtime version: 11.8.89
- CUDA_MODULE_LOADING set to: LAZY
- GPU models and configuration: 
- GPU 0: NVIDIA A100-SXM4-40GB
- GPU 1: NVIDIA A100-SXM4-40GB
- GPU 2: NVIDIA A100-SXM4-40GB
- GPU 3: NVIDIA A100-SXM4-40GB
- Huggingface_hub version: 0.14.1

- Nvidia driver version: 510.47.03


(I am using SLURM:)
- Accelarate’s config: (not sure which file you mean)
accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
Please select a choice using the arrow or number keys, and selecting with enter
 ➔  This machine
    AWS (Amazon SageMaker)

there was no option for the nodes I already assigned

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[ ] My own task or dataset (give details below)

Reproduction

based on the SLURM-accelerate template

2 nodes
4 GPUs (per node)

I pre-train the wav2vec demo on librispeech
I started receiving error messages (tried 3+ times with restarting the pretraining)

@alexeib @rohan-varma @mrshenli reported also in pytroch issue

"RuntimeError: DDP expects same model across all ranks, but Rank 1 has 237 params, while rank 2 has inconsistent 0 params."

Expected behavior

Expect to be able to pretrain a model using multi node and multi gpus.

May 27 '23 15:05 flckv