accelerate
accelerate copied to clipboard
DDP multi node multi gpu inconsistent params
System Info
### System Info
- Accelarate version: 0.19.0
- OS: Ubuntu 20.04.4 LTS (x86_64)
- Python version: 3.11.3
- Numpy: 1.24.3
- PyTorch version: 2.0.1
- CUDA used to build PyTorch: 11.8
- Is CUDA available: True
- CUDA runtime version: 11.8.89
- CUDA_MODULE_LOADING set to: LAZY
- GPU models and configuration:
- GPU 0: NVIDIA A100-SXM4-40GB
- GPU 1: NVIDIA A100-SXM4-40GB
- GPU 2: NVIDIA A100-SXM4-40GB
- GPU 3: NVIDIA A100-SXM4-40GB
- Huggingface_hub version: 0.14.1
- Nvidia driver version: 510.47.03
(I am using SLURM:)
- Accelarate’s config: (not sure which file you mean)
accelerate config
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------In which compute environment are you running?
Please select a choice using the arrow or number keys, and selecting with enter
âž” This machine
AWS (Amazon SageMaker)
there was no option for the nodes I already assigned
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [ ] My own task or dataset (give details below)
Reproduction
- based on the SLURM-accelerate template
- 2 nodes
- 4 GPUs (per node)
-
I pre-train the wav2vec demo on librispeech
-
I started receiving error messages (tried 3+ times with restarting the pretraining)
@alexeib @rohan-varma @mrshenli reported also in pytroch issue
"RuntimeError: DDP expects same model across all ranks, but Rank 1 has 237 params, while rank 2 has inconsistent 0 params."
Expected behavior
Expect to be able to pretrain a model using multi node and multi gpus.