deepmd-kit Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-master/source/op/custom

Bug summary

I installed LAMMPS from source code(build-in mode)https://github.com/deepmodeling/deepmd-kit/blob/master/doc/install/install-lammps.md, the versions of tensorflow, rocm, lammps are 2.5.0, 4.0.1, 29Sep2021_updata3 separately.When I use lmp_mpi, the following error occurred.

2022-06-30 13:52:03.399806: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-06-30 13:52:03.504022: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so 2022-06-30 13:52:03.585404: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: pciBusID: 0000:26:00.0 name: Device 66a1 ROCm AMDGPU Arch: gfx906 coreClock: 1.7GHz coreCount: 64 deviceMemorySize: 15.98GiB deviceMemoryBandwidth: 953.67GiB/s 2022-06-30 13:52:03.679383: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so 2022-06-30 13:52:03.746202: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so 2022-06-30 13:52:04.448369: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocfft.so 2022-06-30 13:52:04.737428: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so 2022-06-30 13:52:04.746124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2022-06-30 13:52:04.750887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-06-30 13:52:04.750970: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2022-06-30 13:52:04.751013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2022-06-30 13:52:04.759145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14731 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:26:00.0) 2022-06-30 13:52:05.375309: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2000074999 Hz hip assert: hipErrorInvalidValue /public/home/ghfund2_b5/app/deepmd-kit-master/source/lib/include/gpu_rocm.h 64 2022-06-30 13:52:06.672644: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at custom_op.cc:17 : Internal: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-master/source/op/custom_op.cc:17 terminate called after throwing an instance of 'deepmd::tf_exception' what(): std::exception [e10r4n07:04423] *** Process received signal *** [e10r4n07:04423] Signal: Aborted (6) [e10r4n07:04423] Signal code: (-6) [e10r4n07:04423] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2aeb867fe5d0] [e10r4n07:04423] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2aeb86a41207] [e10r4n07:04423] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2aeb86a428f8] [e10r4n07:04423] [ 3] /public/software/compiler/gcc-11.2.0/lib64/libstdc++.so.6(+0x9c1ae)[0x2aeb85f5c1ae] [e10r4n07:04423] [ 4] /public/software/compiler/gcc-11.2.0/lib64/libstdc++.so.6(+0xa7516)[0x2aeb85f67516] [e10r4n07:04423] [ 5] /public/software/compiler/gcc-11.2.0/lib64/libstdc++.so.6(+0xa7581)[0x2aeb85f67581] [e10r4n07:04423] [ 6] /public/software/compiler/gcc-11.2.0/lib64/libstdc++.so.6(+0xa7805)[0x2aeb85f67805] [e10r4n07:04423] [ 7] /public/home/ghfund2_b5/app/deepmd-kit-master/lib/libdeepmd_cc.so(_ZN6deepmd12check_statusERKN10tensorflow6StatusE+0x77)[0x2aeb678d4b57] [e10r4n07:04423] [ 8] /public/home/ghfund2_b5/app/deepmd-kit-master/lib/libdeepmd_cc.so(+0x174c3)[0x2aeb678c64c3] [e10r4n07:04423] [ 9] /public/home/ghfund2_b5/app/deepmd-kit-master/lib/libdeepmd_cc.so(ZN6deepmd7DeepPot13compute_innerERdRSt6vectorIdSaIdEES5_RKS4_RKS2_IiSaIiEES7_iRKiS7_S7+0x148)[0x2aeb678c7718] [e10r4n07:04423] [10] /public/home/ghfund2_b5/app/deepmd-kit-master/lib/libdeepmd_cc.so(ZN6deepmd7DeepPot7computeERdRSt6vectorIdSaIdEES5_RKS4_RKS2_IiSaIiEES7_iRKNS_10InputNlistERKiS7_S7+0x207)[0x2aeb678c7c17] [e10r4n07:04423] [11] lmp_mpi[0x5fcf8b] [e10r4n07:04423] [12] lmp_mpi[0x580ea2] [e10r4n07:04423] [13] lmp_mpi[0x459cb9] [e10r4n07:04423] [14] lmp_mpi[0x413b1b] [e10r4n07:04423] [15] lmp_mpi[0x4142dc] [e10r4n07:04423] [16] lmp_mpi[0x409648] [e10r4n07:04423] [17] /usr//lib64/libc.so.6(__libc_start_main+0xf5)[0x2aeb86a2d3d5] [e10r4n07:04423] [18] lmp_mpi[0x40a68f] [e10r4n07:04423] *** End of error message *** /opt/gridview/slurm/spool_slurmd/job21021257/slurm_script: line 17: 4423 Aborted lmp_mpi -i input.lammps

LAMMPS (29 Sep 2021 - Update 3) Reading data file ... triclinic box = (0.0000000 0.0000000 0.0000000) to (44.996300 44.996300 44.996300) with tilt (0.0000000 0.0000000 0.0000000) 1 by 1 by 1 MPI processor grid reading atoms ... 760 atoms read_data CPU = 0.022 seconds Changing box ... triclinic box = (0.0000000 0.0000000 0.0000000) to (44.996300 44.996300 44.996300) with tilt (0.0000000 0.0000000 0.0000000) Summary of lammps deepmd module ...

Info of deepmd-kit: installed to: /public/home/ghfund2_b5/app/deepmd-kit-master source:
source branch:
source commit:
source commit at:
surpport model ver.:1.0 build float prec: double build with tf inc: /public/software/apps/DeepLearning/TensorFlow/TF_C2/include;/public/software/apps/DeepLearning/TensorFlow/TF_C2/include build with tf lib: /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so;/public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so set tf intra_op_parallelism_threads: 0 set tf inter_op_parallelism_threads: 0 Info of lammps module: use deepmd-kit at: /public/home/ghfund2_b5/app/deepmd-kit-master source:
source branch:
source commit:
source commit at:
build float prec: double build with tf inc: /public/software/apps/DeepLearning/TensorFlow/TF_C2/include;/public/software/apps/DeepLearning/TensorFlow/TF_C2/include build with tf lib: /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so;/public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so Info of model(s): using 1 model(s): graph.000.pb rcut in model: 6 ntypes in model: 3

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

USER-DEEPMD package: The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Neighbor list info ... update every 1 steps, delay 10 steps, check yes max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 7 ghost atom cutoff = 7 binsize = 3.5, bins = 13 13 13 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair deepmd, perpetual attributes: full, newton on pair build: full/bin/atomonly stencil: full/bin/3d bin: standard Setting up Verlet run ... Unit style : metal Current step : 0 Time step : 0.00025 Internal: 2 root error(s) found. (0) Internal: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-master/source/op/custom_op.cc:17 [[{{node ProdEnvMatA}}]] (1) Internal: Operation received an exception: DeePMD-kit Error: CUDA Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-master/source/op/custom_op.cc:17 [[{{node ProdEnvMatA}}]] [[o_energy/_27]] 0 successful operations. 0 derived errors ignored.

DeePMD-kit Version

2.0

TensorFlow Version

2.5.0

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

lmp_mpi -i lammps.slurm

Steps to Reproduce

nothing

Further Information, Files, and Links

No response

Jun 30 '22 06:06 ShangZhe-1999

"CudaAssert" is a typo. The actual error is

hip assert: hipErrorInvalidValue /public/home/ghfund2_b5/app/deepmd-kit-master/source/lib/include/gpu_rocm.h 64

Jun 30 '22 19:06 njzjz

Thanks for your help. I have made the changes according to your advice(https://github.com/deepmodeling/deepmd-kit/pull/1802), but the similiar error still occured. I checked this directory: tensorflow/core/framework , and found that only "op_kernel.h" exists in tensorflow/core/framework, there is no "op_kernel.cc" at all. So the actual error is stilled not solved.

hip assert: hipErrorInvalidValue /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/lib/include/gpu_rocm.h 64 2022-07-01 16:13:01.575945: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at custom_op.cc:17 : Internal: Operation received an exception: DeePMD-kit Error: HIP Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/op/custom_op.cc:17

Internal: 2 root error(s) found. (0) Internal: Operation received an exception: DeePMD-kit Error: HIP Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/op/custom_op.cc:17 [[{{node ProdEnvMatA}}]] (1) Internal: Operation received an exception: DeePMD-kit Error: HIP Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/op/custom_op.cc:17 [[{{node ProdEnvMatA}}]] [[o_energy/_29]] 0 successful operations. 0 derived errors ignored. ERROR: DeePMD-kit Error: TensorFlow Error: Internal: 2 root error(s) found. (0) Internal: Operation received an exception: DeePMD-kit Error: HIP Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/op/custom_op.cc:17 [[{node ProdEnvMatA}]] (1) Internal: Operation received an exception: DeePMD-kit Error: HIP Assert, in file /public/home/ghfund2_b5/app/deepmd-kit-2.1.0/source/op/custom_op.cc:17 [[{node ProdEnvMatA}]] [[o_energy/_29]] 0 successful operations. 0 derived errors ignored. (../pair_deepmd.cpp:385)

"CudaAssert" is a typo. The actual error is

hip assert: hipErrorInvalidValue /public/home/ghfund2_b5/app/deepmd-kit-master/source/lib/include/gpu_rocm.h 64

Jul 01 '22 11:07 ShangZhe-1999

Could you check it's not out-of-memory?

@denghuilu any ideas?

Jul 01 '22 20:07 njzjz

I changed rocm-4.0.1 into dtk-21.04, the error mentioned above didn't occur again. So I guess rocm-4.0.1 maybe the reason for casuing the problem. But a new error occured: /data/jenkins_workspace/workspace/hip_21.04/hip/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

2022-07-02 11:03:04.065056: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-07-02 11:03:04.123404: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libamdhip64.so 2022-07-02 11:03:04.129170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1747] Found device 0 with properties: pciBusID: 0000:04:00.0 name: Device 66a1 ROCm AMDGPU Arch: gfx906 coreClock: 1.7GHz coreCount: 64 deviceMemorySize: 15.98GiB deviceMemoryBandwidth: 953.67GiB/s 2022-07-02 11:03:04.226959: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so 2022-07-02 11:03:04.363285: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libMIOpen.so 2022-07-02 11:03:04.898105: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocfft.so 2022-07-02 11:03:05.048960: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocrand.so 2022-07-02 11:03:05.065268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2022-07-02 11:03:05.065780: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-07-02 11:03:05.065857: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2022-07-02 11:03:05.065900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2022-07-02 11:03:05.075822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14731 MB memory) -> physical GPU (device: 0, name: Device 66a1, pci bus id: 0000:04:00.0) 2022-07-02 11:03:05.930924: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 1999885000 Hz 2022-07-02 11:03:06.797016: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library librocblas.so /data/jenkins_workspace/workspace/hip_21.04/hip/rocclr/hip_code_object.cpp:120: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!") [e10r4n02:11411] *** Process received signal *** [e10r4n02:11411] Signal: Aborted (6) [e10r4n02:11411] Signal code: (-6) [e10r4n02:11411] [ 0] /usr//lib64/libpthread.so.0(+0xf5d0)[0x2b16f6d705d0] [e10r4n02:11411] [ 1] /usr//lib64/libc.so.6(gsignal+0x37)[0x2b16f6fb3207] [e10r4n02:11411] [ 2] /usr//lib64/libc.so.6(abort+0x148)[0x2b16f6fb48f8] [e10r4n02:11411] [ 3] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(+0x1b3f36)[0x2b1703115f36] [e10r4n02:11411] [ 4] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(+0x92d1e)[0x2b1702ff4d1e] [e10r4n02:11411] [ 5] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(+0xb9284)[0x2b170301b284] [e10r4n02:11411] [ 6] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(+0x92f26)[0x2b1702ff4f26] [e10r4n02:11411] [ 7] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(+0x15b22f)[0x2b17030bd22f] [e10r4n02:11411] [ 8] /public/software/compiler/rocm/dtk-21.04/lib/libamdhip64.so.4(hipModuleLoadData+0x478)[0x2b1703084d48] [e10r4n02:11411] [ 9] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(+0x1019845c)[0x2b16e7ef345c] [e10r4n02:11411] [10] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(+0xfdfe7ab)[0x2b16e7b597ab] [e10r4n02:11411] [11] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(+0xfe00810)[0x2b16e7b5b810] [e10r4n02:11411] [12] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(+0xfdc356d)[0x2b16e7b1e56d] [e10r4n02:11411] [13] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so.2(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x1d4)[0x2b16f4d130f4] [e10r4n02:11411] [14] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so.2(+0xb6ea14)[0x2b16f4e18a14] [e10r4n02:11411] [15] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so.2(+0xb6f738)[0x2b16f4e19738] [e10r4n02:11411] [16] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x2ae)[0x2b16db45199e] [e10r4n02:11411] [17] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_cc.so.2(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x43)[0x2b16db44e943] [e10r4n02:11411] [18] /public/software/apps/DeepLearning/TensorFlow/TF_C2/lib/libtensorflow_framework.so.2(+0x12a44f5)[0x2b16f554e4f5] [e10r4n02:11411] [19] /usr//lib64/libpthread.so.0(+0x7dd5)[0x2b16f6d68dd5] [e10r4n02:11411] [20] /usr//lib64/libc.so.6(clone+0x6d)[0x2b16f707aead] [e10r4n02:11411] *** End of error message *** /opt/gridview/slurm/spool_slurmd/job21074602/slurm_script: line 18: 11411 Aborted lmp_mpi -i input.lammps

Jul 02 '22 12:07 ShangZhe-1999

This error means TensorFlow was not built for this device. Did you use a different device to build it?

Oct 31 '22 15:10 njzjz

Close as we cannot reproduce the error.

Oct 16 '23 18:10 njzjz