deepmd-kit
deepmd-kit copied to clipboard
Error with parallel training
Bug summary
When trying to run training in parallel, I get an error in the broadcast global variables to other tasks step.
DeePMD-kit Version
2.1.3
TensorFlow Version
2.9.0
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
DeePMD-kit log file: log.txt
Steps to Reproduce
CUDA_VISIBLE_DEVICES=0,1 mpirun -np 2 dp train --mpi-log=master input.json
Further Information, Files, and Links
Slurm job submission:
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=6
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export TF_INTRA_OP_PARALLELISM_THREADS=$SLURM_CPUS_PER_TASK
export TF_INTER_OP_PARALLELISM_THREADS=$SLURM_NTASKS
CUDA_VISIBLE_DEVICES=0,1 mpirun -np $SLURM_NTASKS dp train --mpi-log=master input.json
I can see it's a NCCL error. You can print the debug message with the environmental variable NCCL_DEBUG=WARN.
This error is the same as that in #1774. I can reproduce it, but it should not be related to deepmd-kit, but a bug from horovod or NCCL. The same error was also reported in https://github.com/horovod/horovod/issues/3625.
See https://github.com/deepmodeling/deepmd-kit/issues/1774#issuecomment-1229124497.