accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

DDP multi node multi gpu inconsistent params

Open flckv opened this issue 2 years ago • 0 comments

System Info

### System Info


- Accelarate version:  0.19.0  
- OS: Ubuntu 20.04.4 LTS (x86_64)
- Python version: 3.11.3
- Numpy: 1.24.3
- PyTorch version: 2.0.1


- CUDA used to build PyTorch: 11.8
- Is CUDA available: True
- CUDA runtime version: 11.8.89
- CUDA_MODULE_LOADING set to: LAZY
- GPU models and configuration: 
- GPU 0: NVIDIA A100-SXM4-40GB
- GPU 1: NVIDIA A100-SXM4-40GB
- GPU 2: NVIDIA A100-SXM4-40GB
- GPU 3: NVIDIA A100-SXM4-40GB
- Huggingface_hub version: 0.14.1

- Nvidia driver version: 510.47.03


(I am using SLURM:)
- Accelarate’s config: (not sure which file you mean)
accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
Please select a choice using the arrow or number keys, and selecting with enter
 âž”  This machine
    AWS (Amazon SageMaker)

there was no option for the nodes I already assigned

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [ ] My own task or dataset (give details below)

Reproduction

  1. based on the SLURM-accelerate template
  • 2 nodes
  • 4 GPUs (per node)
  1. I pre-train the wav2vec demo on librispeech

  2. I started receiving error messages (tried 3+ times with restarting the pretraining)

@alexeib @rohan-varma @mrshenli reported also in pytroch issue

"RuntimeError: DDP expects same model across all ranks, but Rank 1 has 237 params, while rank 2 has inconsistent 0 params."

Expected behavior

Expect to be able to pretrain a model using multi node and multi gpus.

flckv avatar May 27 '23 15:05 flckv