deepmd-kit Error with parallel training

Bug summary

When trying to run training in parallel, I get an error in the broadcast global variables to other tasks step.

DeePMD-kit Version

2.1.3

TensorFlow Version

2.9.0

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

DeePMD-kit log file: log.txt

Steps to Reproduce

CUDA_VISIBLE_DEVICES=0,1 mpirun -np 2 dp train --mpi-log=master input.json

Further Information, Files, and Links

Slurm job submission:

#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=6

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export TF_INTRA_OP_PARALLELISM_THREADS=$SLURM_CPUS_PER_TASK
export TF_INTER_OP_PARALLELISM_THREADS=$SLURM_NTASKS

CUDA_VISIBLE_DEVICES=0,1 mpirun -np $SLURM_NTASKS dp train --mpi-log=master input.json

Jul 28 '22 11:07 omidshy

I can see it's a NCCL error. You can print the debug message with the environmental variable NCCL_DEBUG=WARN.

Jul 29 '22 17:07 njzjz

Here is the log with the NCCL_DEBUG variable set to WARN.

log_2.txt

Jul 29 '22 18:07 omidshy

This error is the same as that in #1774. I can reproduce it, but it should not be related to deepmd-kit, but a bug from horovod or NCCL. The same error was also reported in https://github.com/horovod/horovod/issues/3625.

Jul 29 '22 19:07 njzjz

See https://github.com/deepmodeling/deepmd-kit/issues/1774#issuecomment-1229124497.

Aug 27 '22 05:08 njzjz