deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] _Parallel_training_using_horovodrun_not_working

Open AnuragKr opened this issue 3 years ago • 14 comments

Bug summary

There is problem coming in parallel training every time it is falling to serial execution mode.All the packages I have installed correctly as per documentation.

DeePMD-kit Version

2.1.1

TensorFlow Version

2.9.1

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Command -- CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 \ dp train --mpi-log=workers input.json

Output -- [1,0]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. [1,0]:Instructions for updating: [1,0]:non-resource variables are not supported in the long term [1,1]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. [1,1]:Instructions for updating: [1,1]:non-resource variables are not supported in the long term [1,0]:Switch to serial execution due to lack of horovod module. [1,1]:Switch to serial execution due to lack of horovod module. [1,0]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step) [1,1]:DEEPMD INFO training data with min nbor dist: 0.8854385688525511 [1,1]:DEEPMD INFO training data with max nbor size: [38, 72] [1,1]:DEEPMD INFO _____ _____ __ __ _____ _ _ _ [1,1]:DEEPMD INFO | __ \ | __ \ | / || __ \ | | ()| | [1,1]:DEEPMD INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | | [1,1]:DEEPMD INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || | [1,1]:DEEPMD INFO | || || /| /| | | | | || || | | < | || | [1,1]:DEEPMD INFO |/ | ||| || |||/ |||| __| [1,1]:DEEPMD INFO Please read and cite: [1,1]:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018) [1,1]:DEEPMD INFO installed to: /tmp/pip-req-build-pjks4pue/_skbuild/linux-x86_64-3.8/cmake-install [1,1]:DEEPMD INFO source : v2.1.1 [1,1]:DEEPMD INFO source brach: master [1,1]:DEEPMD INFO source commit: https://github.com/deepmodeling/deepmd-kit/commit/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9 [1,1]:DEEPMD INFO source commit at: 2022-04-16 11:11:16 +0800 [1,1]:DEEPMD INFO build float prec: double [1,1]:DEEPMD INFO build with tf inc: /tmp/pip-build-env-dfkmanfm/normal/lib/python3.8/site-packages/tensorflow/include [1,1]:DEEPMD INFO build with tf lib: [1,1]:DEEPMD INFO ---Summary of the training--------------------------------------- [1,1]:DEEPMD INFO running on: hp-HP-Z8-G4-Workstation [1,1]:DEEPMD INFO computing device: gpu:0 [1,1]:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1 [1,1]:DEEPMD INFO Count of visible GPU: 2 [1,1]:DEEPMD INFO num_intra_threads: 6 [1,1]:DEEPMD INFO num_inter_threads: 5 [1,1]:DEEPMD INFO -----------------------------------------------------------------

Steps to Reproduce

  1. Go to the Dir - deepmd-kit/examples/water/se_e2_a
  2. Run command - CUDA_VISIBLE_DEVICES=0,1 horovodrun -np 2 \ dp train --mpi-log=workers input.json

GPU Configuration Mon Jun 20 14:14:27 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.43.04 Driver Version: 515.43.04 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:15:00.0 Off | N/A | | 30% 34C P8 19W / 250W | 10MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:2D:00.0 Off | N/A | | 30% 40C P8 17W / 250W | 192MiB / 11264MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1047 G /usr/lib/xorg/Xorg 4MiB | | 0 N/A N/A 1536 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 1047 G /usr/lib/xorg/Xorg 35MiB | | 1 N/A N/A 1536 G /usr/lib/xorg/Xorg 113MiB | | 1 N/A N/A 1666 G /usr/bin/gnome-shell 11MiB | | 1 N/A N/A 2009 G ...mviewer/tv_bin/TeamViewer 12MiB | +-----------------------------------------------------------------------------+

Further Information, Files, and Links

No response

AnuragKr avatar Jun 20 '22 08:06 AnuragKr

[1,0]:Switch to serial execution due to lack of horovod module.

Can you check import horovod.tensorflow?

Your horovod may not be built against tensorflow. Please refer horovod's documentation.

njzjz avatar Jun 20 '22 17:06 njzjz

I checked import horovod.tensorflow it's working and I followed all the steps mentioned in the documentation but still I am getting same error. I am doing it all this in virtual environment hope that is not an issue.

Horovodrun --check-build output -- Horovod v0.24.3:

Available Frameworks: [X] TensorFlow [X] PyTorch [ ] MXNet

Available Controllers: [X] MPI [ ] Gloo

Available Tensor Operations: [X] NCCL [ ] DDL [ ] CCL [X] MPI [ ] Gloo

AnuragKr avatar Jun 21 '22 05:06 AnuragKr

That's wired. Could you add raise after the following line? It can help to debug what's the error here.

https://github.com/deepmodeling/deepmd-kit/blob/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9/deepmd/train/run_options.py#L183

njzjz avatar Jun 21 '22 06:06 njzjz

As I mentioned above I was doing it in a virtual environment now I installed horovod again globally now it is working in virtual environment also.But Now I am getting new error --- Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json Output -- hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Using network Socket NCCL version 2.12.12+cuda11.7 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Failed to open libibverbs.so[.1] hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0> hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Using network Socket

hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1

hp-HP-Z8-G4-Workstation:198127:198131 [1] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:963 -> 1

AnuragKr avatar Jun 21 '22 06:06 AnuragKr

It looks like your virtual environment does not install NVIDIA driver?

njzjz avatar Jun 21 '22 19:06 njzjz

NVIDIA driver is installed this error come whenever I try to run deepmd-kit with more than 1 process.

AnuragKr avatar Jun 23 '22 11:06 AnuragKr

This error may come from NCCL, see https://github.com/NVIDIA/nccl/issues/658. Does the solution mentioned in this issue work for you?

njzjz avatar Jun 25 '22 00:06 njzjz

Solution -- given by benmenadue unable to understand his solution. If you can help me out what changes do I have to make.

System -- NCCL - 2.12.12 Workstation with 2 GPU CUDA - 11.7 Steps I had done --

  1. anurag1@hp-HP-Z8-G4-Workstation:~/.local/nccl$ objdump -p lib/libnccl.so.2.12.12 | grep NEEDED NEEDED libpthread.so.0 NEEDED librt.so.1 NEEDED libdl.so.2 NEEDED libstdc++.so.6 NEEDED libm.so.6 NEEDED libgcc_s.so.1 NEEDED libc.so.6 NEEDED ld-linux-x86-64.so.2 It doesn't require libcudart
  2. I ran nccl-test as mentioned here nccl-test it worked but when I tried to run nccl-test with CUDART as mentioned in above link I got -- ./build/all_gather_perf: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory. For resolving this I had given path explicitly to libcudart.so.11.0 but still not working so I need to copy that file to lib64.

AnuragKr avatar Jun 26 '22 07:06 AnuragKr

@njzjz The link you mentioned I tried that link I was able to run that nccl-test via cudart (tensorflow) anurag1@hp-HP-Z8-G4-Workstation:/nccl-tests$ NCCL_DEBUG=WARN LD_LIBRARY_PATH=~/.local/nccl/lib/ ./src/build-shared/all_gather_perf
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 Using devices Rank 0 Pid 775733 on hp-HP-Z8-G4-Workstation device 0 [0x15] NVIDIA GeForce RTX 2080 Ti NCCL version 2.12.12+cuda11.7

                                           out-of-place                       in-place          
   size         count      type     time   algbw   busbw  error     time   algbw   busbw  error
    (B)    (elements)               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
33554432       8388608     float    125.7  266.96    0.00  0e+00     0.90  37470.05    0.00  0e+00

Out of bounds values : 0 OK Avg bus bandwidth : 0

But error still persist Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json

Error Stack Trace --

hp-HP-Z8-G4-Workstation:779166:779172 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library' hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:913 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:950 -> 1 hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:963 -> 1 Traceback (most recent call last): File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call return fn(*args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn return self._call_tf_sessionrun(options, feed_dict, fetch_list, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(**dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 435, in _init_session run_sess(self.sess, bcast_op) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/utils/sess.py", line 21, in run_sess return sess.run(*args, **kwargs) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run result = self._run(None, fetches, feed_dict, options_ptr, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run results = self._do_run(handle, final_targets, final_fetches, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run return self._do_call(_run_fn, feeds, fetches, targets, options, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' defined at (most recent call last): File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(**dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session bcast_op = self.run_opt._HVD.broadcast_global_variables(0) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables return broadcast_variables(_global_variables(), root_rank) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables return broadcast_group(variables, root_rank, process_set) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, File "", line 515, in horovod_broadcast Node: 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' ncclCommInitRank failed: unhandled cuda error [[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]

Original stack trace for 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0': File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in sys.exit(main()) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main train_dp(**dict_args) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train _do_work(jdata, run_opt, is_compress) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work model.train(train_data, valid_data) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train self._init_session() File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session bcast_op = self.run_opt._HVD.broadcast_global_variables(0) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables return broadcast_variables(_global_variables(), root_rank) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables return broadcast_group(variables, root_rank, process_set) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, File "", line 515, in horovod_broadcast File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal ret = Operation( File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2133, in init self._traceback = tf_stack.extract_stack_for_node(self._c_op)

hp-HP-Z8-G4-Workstation:779167:779171 [1] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL hp-HP-Z8-G4-Workstation:779167:779171 [1] NCCL INFO init.cc:1084 -> 4

hp-HP-Z8-G4-Workstation:779166:779172 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:1084 -> 4

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[52430,1],1] Exit code: 1

AnuragKr avatar Jun 29 '22 09:06 AnuragKr

Did you compile NCCL by yourself?

njzjz avatar Jun 29 '22 21:06 njzjz

Yes

AnuragKr avatar Jun 30 '22 16:06 AnuragKr

I suggest you try our conda package to see whether the error comes from the compilation or runtime environments.

conda create -n deepmd horovod nccl cudatoolkit=11.6 -c https://conda.deepmodeling.com

In https://github.com/NVIDIA/nccl/issues/658, sclarkson suggested removing nccl/src/enhcompat.cc. You may have a try.

njzjz avatar Jul 01 '22 02:07 njzjz

@njzjz I tried using conda but still error is same Output -- [0] DEEPMD rank:0 INFO built training [0] DEEPMD rank:0 INFO initialize model from scratch [0] DEEPMD rank:0 INFO broadcast global variables to other tasks [1] DEEPMD rank:1 INFO built training [1] DEEPMD rank:1 INFO receive global variables from task#0 [1] Traceback (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call [1] return fn(args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn [1] return self._call_tf_sessionrun(options, feed_dict, fetch_list, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun [1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1] [1] [1] During handling of the above exception, another exception occurred: [1] [1] Traceback (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(**dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session [1] run_sess(self.sess, bcast_op) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess [1] return sess.run(args, **kwargs) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run [1] result = self._run(None, fetches, feed_dict, options_ptr, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run [1] results = self._do_run(handle, final_targets, final_fetches, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run [1] return self._do_call(_run_fn, feeds, fetches, targets, options, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call [1] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter [1] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: [1] [1] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last): [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(**dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [1] return broadcast_variables(_global_variables(), root_rank) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [1] return broadcast_group(variables, root_rank, process_set) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [1] File "", line 515, in horovod_broadcast [1] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0' [1] ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]] [1] [1] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0': [1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [1] sys.exit(main()) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [1] train_dp(**dict_args) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [1] _do_work(jdata, run_opt, is_compress) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [1] model.train(train_data, valid_data) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [1] self._init_session() [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [1] return broadcast_variables(_global_variables(), root_rank) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [1] return broadcast_group(variables, root_rank, process_set) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [1] File "", line 515, in horovod_broadcast [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper [1] op = g._create_op_internal(op_type_name, inputs, dtypes=None, [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal [1] ret = Operation( [1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init [1] self._traceback = tf_stack.extract_stack_for_node(self._c_op) [1] [0] Traceback (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call [0] return fn(args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn [0] return self._call_tf_sessionrun(options, feed_dict, fetch_list, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun [0] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [0] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][0] [0] [0] During handling of the above exception, another exception occurred: [0] [0] Traceback (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(**dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session [0] run_sess(self.sess, bcast_op) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess [0] return sess.run(args, **kwargs) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run [0] result = self._run(None, fetches, feed_dict, options_ptr, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run [0] results = self._do_run(handle, final_targets, final_fetches, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run [0] return self._do_call(_run_fn, feeds, fetches, targets, options, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call [0] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter [0] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: [0] [0] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last): [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(**dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [0] return broadcast_variables(_global_variables(), root_rank) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [0] return broadcast_group(variables, root_rank, process_set) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [0] File "", line 515, in horovod_broadcast [0] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0' [0] ncclCommInitRank failed: unhandled cuda error [0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]] [0] [0] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0': [0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in [0] sys.exit(main()) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main [0] train_dp(**dict_args) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train [0] _do_work(jdata, run_opt, is_compress) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work [0] model.train(train_data, valid_data) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train [0] self._init_session() [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session [0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables [0] return broadcast_variables(_global_variables(), root_rank) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables [0] return broadcast_group(variables, root_rank, process_set) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in [0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set)) [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast [0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank, [0] File "", line 515, in horovod_broadcast [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper [0] op = g._create_op_internal(op_type_name, inputs, dtypes=None, [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal [0] ret = Operation( [0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init [0] self._traceback = tf_stack.extract_stack_for_node(self._c_op)

Regarding nccl there is no src folder and no enhcompat.cc as I have installed 2.12.12

I think I need to change system something wrong with the system or some corrupt cuda installation.

AnuragKr avatar Jul 03 '22 06:07 AnuragKr

I have the same problem.

Training with 1 GPU is fine. Training with 2 GPUs with horovodrun or mpirun results in this error:

[1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, [1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error [1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1]

I did a clean re-installation of ubuntu 22.04 and installed only the deepmd 2.1.3 cuda 11.6 conda environment without any other packages. I do not think it is a package conflict problem on my side.

Lewis-YL avatar Jul 04 '22 19:07 Lewis-YL

https://github.com/horovod/horovod/issues/3625#issuecomment-1228884495 could resolve this issue temporarily. The original error should be tracked in the upstream repository.

For conda users: a new NCCL package has been uploaded to our conda channel.

njzjz avatar Aug 27 '22 05:08 njzjz

@njzjz Pardon me for asking questions on this issue after a long time. Could you please tell me in which file I need to make a change - CUDARTLIB="cuda". I couldn't find any MAKEFILE which consists of this line. If I have followed installation from the source. For conda version -- Could you please provide conda channel link where nccl package was uploaded or you are referring to this one -- conda install -c conda-forge nccl

AnuragKr avatar Sep 22 '22 16:09 AnuragKr

@AnuragKr The variable can be assigned by make CUDARTLIB="cuda".

The conda channel is https://conda.deepmodeling.com

njzjz avatar Sep 22 '22 19:09 njzjz

@njzjz Thanks for the prompt response. When I try to run I got the following error -- make_error I think I am missing some steps as it requires Makefile as an input file but I don't have Makefile in the deepmd directory. Please let me know how to make the above changes.

For conda version -- Above link redirects to official website installation page.From where to download nccl package or this command -- conda install -c conda-forge nccl will work.

AnuragKr avatar Sep 23 '22 03:09 AnuragKr

Makefile is for NCCL - i.e. https://github.com/NVIDIA/nccl/blob/master/Makefile

conda: conda install nccl -c https://conda.deepmodeling.com

njzjz avatar Sep 23 '22 08:09 njzjz