Open
AnuragKr
opened this issue 3 years ago
•
14 comments
Bug summary
There is problem coming in parallel training every time it is falling to serial execution mode.All the packages I have installed correctly as per documentation.
Output --
[1,0]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,0]:Instructions for updating:
[1,0]:non-resource variables are not supported in the long term
[1,1]:WARNING:tensorflow:From /home/anurag/.local/lib/python3.8/site-packages/tensorflow/python/compat/v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
[1,1]:Instructions for updating:
[1,1]:non-resource variables are not supported in the long term
[1,0]:Switch to serial execution due to lack of horovod module.
[1,1]:Switch to serial execution due to lack of horovod module.
[1,0]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[1,1]:DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[1,1]:DEEPMD INFO training data with min nbor dist: 0.8854385688525511
[1,1]:DEEPMD INFO training data with max nbor size: [38, 72]
[1,1]:DEEPMD INFO _____ _____ __ __ _____ _ _ _
[1,1]:DEEPMD INFO | __ \ | __ \ | / || __ \ | | ()| |
[1,1]:DEEPMD INFO | | | | ___ ___ | |) || \ / || | | | ______ | | __ _ | |
[1,1]:DEEPMD INFO | | | | / _ \ / _ | / | |/| || | | |||| |/ /| || |
[1,1]:DEEPMD INFO | || || /| /| | | | | || || | | < | || |
[1,1]:DEEPMD INFO |/ | ||| || |||/ |||| __|
[1,1]:DEEPMD INFO Please read and cite:
[1,1]:DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[1,1]:DEEPMD INFO installed to: /tmp/pip-req-build-pjks4pue/_skbuild/linux-x86_64-3.8/cmake-install
[1,1]:DEEPMD INFO source : v2.1.1
[1,1]:DEEPMD INFO source brach: master
[1,1]:DEEPMD INFO source commit: https://github.com/deepmodeling/deepmd-kit/commit/c4f0cec0e20bab38579a3a29f1106cbee4a8ecf9
[1,1]:DEEPMD INFO source commit at: 2022-04-16 11:11:16 +0800
[1,1]:DEEPMD INFO build float prec: double
[1,1]:DEEPMD INFO build with tf inc: /tmp/pip-build-env-dfkmanfm/normal/lib/python3.8/site-packages/tensorflow/include
[1,1]:DEEPMD INFO build with tf lib:
[1,1]:DEEPMD INFO ---Summary of the training---------------------------------------
[1,1]:DEEPMD INFO running on: hp-HP-Z8-G4-Workstation
[1,1]:DEEPMD INFO computing device: gpu:0
[1,1]:DEEPMD INFO CUDA_VISIBLE_DEVICES: 0,1
[1,1]:DEEPMD INFO Count of visible GPU: 2
[1,1]:DEEPMD INFO num_intra_threads: 6
[1,1]:DEEPMD INFO num_inter_threads: 5
[1,1]:DEEPMD INFO -----------------------------------------------------------------
I checked import horovod.tensorflow it's working and I followed all the steps mentioned in the documentation but still I am getting same error.
I am doing it all this in virtual environment hope that is not an issue.
As I mentioned above I was doing it in a virtual environment now I installed horovod again globally now it is working in virtual environment also.But Now I am getting new error ---
Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json
Output --
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0>
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Failed to open libibverbs.so[.1]
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0>
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Bootstrap : Using enp4s0f2:10.128.3.131<0>
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Failed to open libibverbs.so[.1]
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO NET/Socket : Using [0]enp4s0f2:10.128.3.131<0>
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO Using network Socket
hp-HP-Z8-G4-Workstation:198126:198130 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:913 -> 1
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:950 -> 1
hp-HP-Z8-G4-Workstation:198126:198130 [0] NCCL INFO init.cc:963 -> 1
hp-HP-Z8-G4-Workstation:198127:198131 [1] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:913 -> 1
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:950 -> 1
hp-HP-Z8-G4-Workstation:198127:198131 [1] NCCL INFO init.cc:963 -> 1
I ran nccl-test as mentioned here nccl-test it worked but when I tried to run nccl-test with CUDART as mentioned in above link I got --
./build/all_gather_perf: error while loading shared libraries: libcudart.so.11.0: cannot open shared object file: No such file or directory.
For resolving this I had given path explicitly to libcudart.so.11.0 but still not working so I need to copy that file to lib64.
@njzjz The link you mentioned I tried that link I was able to run that nccl-test via cudart
(tensorflow) anurag1@hp-HP-Z8-G4-Workstation:/nccl-tests$ NCCL_DEBUG=WARN LD_LIBRARY_PATH=~/.local/nccl/lib/ ./src/build-shared/all_gather_perf
nThread 1 nGpus 1 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1
Using devices
Rank 0 Pid 775733 on hp-HP-Z8-G4-Workstation device 0 [0x15] NVIDIA GeForce RTX 2080 Ti
NCCL version 2.12.12+cuda11.7
But error still persist
Command -- CUDA_VISIBLE_DEVICES=0,1 mpirun -hostfile hostfile -np 2 -x NCCL_DEBUG=INFO dp train --mpi-log=workers input.json
Error Stack Trace --
hp-HP-Z8-G4-Workstation:779166:779172 [0] init.cc:255 NCCL WARN Cuda failure 'CUDA driver is a stub library'
hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:913 -> 1
hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:950 -> 1
hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:963 -> 1
Traceback (most recent call last):
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
return fn(*args)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
[[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
sys.exit(main())
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main
train_dp(**dict_args)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train
_do_work(jdata, run_opt, is_compress)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
model.train(train_data, valid_data)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train
self._init_session()
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 435, in _init_session
run_sess(self.sess, bcast_op)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/utils/sess.py", line 21, in run_sess
return sess.run(*args, **kwargs)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 967, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1190, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
Detected at node 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0' defined at (most recent call last):
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
sys.exit(main())
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main
train_dp(**dict_args)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train
_do_work(jdata, run_opt, is_compress)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
model.train(train_data, valid_data)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train
self._init_session()
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session
bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables
return broadcast_variables(_global_variables(), root_rank)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
return broadcast_group(variables, root_rank, process_set)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in
return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
File "", line 515, in horovod_broadcast
Node: 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0'
ncclCommInitRank failed: unhandled cuda error
[[{{node HorovodBroadcast_layer_0_type_1_bias_Adam_1_0}}]]
Original stack trace for 'HorovodBroadcast_layer_0_type_1_bias_Adam_1_0':
File "/home/anurag1/venv/tensorflow/bin/dp", line 8, in
sys.exit(main())
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/main.py", line 473, in main
train_dp(**dict_args)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 106, in train
_do_work(jdata, run_opt, is_compress)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
model.train(train_data, valid_data)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 443, in train
self._init_session()
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/deepmd/train/trainer.py", line 430, in _init_session
bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/init.py", line 339, in broadcast_global_variables
return broadcast_variables(_global_variables(), root_rank)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
return broadcast_group(variables, root_rank, process_set)
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/functions.py", line 42, in
return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
File "", line 515, in horovod_broadcast
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal
ret = Operation(
File "/home/anurag1/venv/tensorflow/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2133, in init
self._traceback = tf_stack.extract_stack_for_node(self._c_op)
hp-HP-Z8-G4-Workstation:779167:779171 [1] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
hp-HP-Z8-G4-Workstation:779167:779171 [1] NCCL INFO init.cc:1084 -> 4
hp-HP-Z8-G4-Workstation:779166:779172 [0] misc/argcheck.cc:30 NCCL WARN ncclGetAsyncError : comm argument is NULL
hp-HP-Z8-G4-Workstation:779166:779172 [0] NCCL INFO init.cc:1084 -> 4
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
@njzjz I tried using conda but still error is same
Output --
[0] DEEPMD rank:0 INFO built training
[0] DEEPMD rank:0 INFO initialize model from scratch
[0] DEEPMD rank:0 INFO broadcast global variables to other tasks
[1] DEEPMD rank:1 INFO built training
[1] DEEPMD rank:1 INFO receive global variables from task#0
[1] Traceback (most recent call last):
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
[1] return fn(args)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
[1] return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
[1] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[1] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
[1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][1]
[1]
[1] During handling of the above exception, another exception occurred:
[1]
[1] Traceback (most recent call last):
[1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[1] sys.exit(main())
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[1] train_dp(**dict_args)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[1] _do_work(jdata, run_opt, is_compress)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[1] model.train(train_data, valid_data)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[1] self._init_session()
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session
[1] run_sess(self.sess, bcast_op)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess
[1] return sess.run(args, **kwargs)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run
[1] result = self._run(None, fetches, feed_dict, options_ptr,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[1] results = self._do_run(handle, final_targets, final_fetches,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run
[1] return self._do_call(_run_fn, feeds, fetches, targets, options,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call
[1] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
[1] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
[1]
[1] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last):
[1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[1] sys.exit(main())
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[1] train_dp(**dict_args)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[1] _do_work(jdata, run_opt, is_compress)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[1] model.train(train_data, valid_data)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[1] self._init_session()
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session
[1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables
[1] return broadcast_variables(_global_variables(), root_rank)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
[1] return broadcast_group(variables, root_rank, process_set)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in
[1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
[1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
[1] File "", line 515, in horovod_broadcast
[1] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0'
[1] ncclCommInitRank failed: unhandled cuda error
[1] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]]
[1]
[1] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0':
[1] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[1] sys.exit(main())
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[1] train_dp(**dict_args)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[1] _do_work(jdata, run_opt, is_compress)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[1] model.train(train_data, valid_data)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[1] self._init_session()
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session
[1] bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables
[1] return broadcast_variables(_global_variables(), root_rank)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
[1] return broadcast_group(variables, root_rank, process_set)
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in
[1] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
[1] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
[1] File "", line 515, in horovod_broadcast
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper
[1] op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal
[1] ret = Operation(
[1] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init
[1] self._traceback = tf_stack.extract_stack_for_node(self._c_op)
[1]
[0] Traceback (most recent call last):
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1377, in _do_call
[0] return fn(args)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1360, in _run_fn
[0] return self._call_tf_sessionrun(options, feed_dict, fetch_list,
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1453, in _call_tf_sessionrun
[0] return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
[0] tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled cuda error
[0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]][0]
[0]
[0] During handling of the above exception, another exception occurred:
[0]
[0] Traceback (most recent call last):
[0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[0] sys.exit(main())
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[0] train_dp(**dict_args)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[0] _do_work(jdata, run_opt, is_compress)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[0] model.train(train_data, valid_data)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[0] self._init_session()
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 445, in _init_session
[0] run_sess(self.sess, bcast_op)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/utils/sess.py", line 21, in run_sess
[0] return sess.run(args, **kwargs)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 967, in run
[0] result = self._run(None, fetches, feed_dict, options_ptr,
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1190, in _run
[0] results = self._do_run(handle, final_targets, final_fetches,
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run
[0] return self._do_call(_run_fn, feeds, fetches, targets, options,
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call
[0] raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
[0] tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:
[0]
[0] Detected at node 'HorovodBroadcast_filter_type_0_matrix_3_0_0' defined at (most recent call last):
[0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[0] sys.exit(main())
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[0] train_dp(**dict_args)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[0] _do_work(jdata, run_opt, is_compress)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[0] model.train(train_data, valid_data)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[0] self._init_session()
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session
[0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables
[0] return broadcast_variables(_global_variables(), root_rank)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
[0] return broadcast_group(variables, root_rank, process_set)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in
[0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
[0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
[0] File "", line 515, in horovod_broadcast
[0] Node: 'HorovodBroadcast_filter_type_0_matrix_3_0_0'
[0] ncclCommInitRank failed: unhandled cuda error
[0] [[{{node HorovodBroadcast_filter_type_0_matrix_3_0_0}}]]
[0]
[0] Original stack trace for 'HorovodBroadcast_filter_type_0_matrix_3_0_0':
[0] File "/home/anurag1/miniconda3/envs/deepmd/bin/dp", line 10, in
[0] sys.exit(main())
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/main.py", line 516, in main
[0] train_dp(**dict_args)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 106, in train
[0] _do_work(jdata, run_opt, is_compress)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/entrypoints/train.py", line 167, in _do_work
[0] model.train(train_data, valid_data)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 453, in train
[0] self._init_session()
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/deepmd/train/trainer.py", line 440, in _init_session
[0] bcast_op = self.run_opt._HVD.broadcast_global_variables(0)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/init.py", line 299, in broadcast_global_variables
[0] return broadcast_variables(_global_variables(), root_rank)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 94, in broadcast_variables
[0] return broadcast_group(variables, root_rank, process_set)
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in broadcast_group
[0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/functions.py", line 42, in
[0] return tf.group([var.assign(broadcast(var, root_rank, process_set=process_set))
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/horovod/tensorflow/mpi_ops.py", line 274, in broadcast
[0] return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank,
[0] File "", line 515, in horovod_broadcast
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper
[0] op = g._create_op_internal(op_type_name, inputs, dtypes=None,
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal
[0] ret = Operation(
[0] File "/home/anurag1/miniconda3/envs/deepmd/lib/python3.10/site-packages/tensorflow/python/framework/ops.py", line 2133, in init
[0] self._traceback = tf_stack.extract_stack_for_node(self._c_op)
Regarding nccl there is no src folder and no enhcompat.cc as I have installed 2.12.12
I think I need to change system something wrong with the system or some corrupt cuda installation.
I did a clean re-installation of ubuntu 22.04 and installed only the deepmd 2.1.3 cuda 11.6 conda environment without any other packages.
I do not think it is a package conflict problem on my side.
https://github.com/horovod/horovod/issues/3625#issuecomment-1228884495 could resolve this issue temporarily. The original error should be tracked in the upstream repository.
For conda users: a new NCCL package has been uploaded to our conda channel.
@njzjz Pardon me for asking questions on this issue after a long time.
Could you please tell me in which file I need to make a change - CUDARTLIB="cuda". I couldn't find any MAKEFILE which consists of this line. If I have followed installation from the source.
For conda version -- Could you please provide conda channel link where nccl package was uploaded or you are referring to this one -- conda install -c conda-forge nccl
@njzjz Thanks for the prompt response.
When I try to run I got the following error --
I think I am missing some steps as it requires Makefile as an input file but I don't have Makefile in the deepmd directory. Please let me know how to make the above changes.
For conda version -- Above link redirects to official website installation page.From where to download nccl package or this command -- conda install -c conda-forge nccl will work.