[MI200][MI100][MI300] UB of ConvHipImplicitGemmBwdXdlops
Originated from: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1911#issuecomment-1533884703
...it failed in
smoke_solver_ConvHipImplicitGemmBwdXdlopstest for the stageFp32 Hip Debug gfx90a. Here is the error message:[2023-05-03T19:26:30.939Z] /home/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --disable-forward --disable-backward-weights --input 128 64 7 7 --weights 64 64 3 3 --batch_size 128 --input_channels 64 --output_channels 64 --spatial_dim_elements 7 7 --filter_dims 3 3 --pads_strides_dilations 1 1 1 1 1 1 --trans_output_pads 0 0 --in_layout NHWC --fil_layout NHWC --out_layout NHWC --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0 [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [GetWorkSpaceSize] 0 [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindConvBwdDataAlgorithm] requestAlgoCount = 1, workspace = 0 [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.16479 ms [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [TryLoad] Find-db regenerating. [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [GetPerfDbPathFile] Found exact perf database file [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindSolutionImpl] ConvHipImplicitGemmBwdXdlops [2023-05-03T19:26:30.939Z] MIOpen(HIP): Warning [FindSolutionImpl] Perf Db: load skipped: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4) [2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindSolutionImpl] Starting search: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4) [2023-05-03T19:26:30.939Z] UndefinedBehaviorSanitizer:DEADLYSIGNAL [2023-05-03T19:26:30.939Z] ==189537==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address (pc 0x7fda3951ae1e bp 0x7ffe35f2cce0 sp 0x7ffe35f2c990 T189537) [2023-05-03T19:26:30.939Z] ==189537==The signal is caused by a READ memory access. [2023-05-03T19:26:30.939Z] ==189537==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. [2023-05-03T19:26:30.939Z] #0 0x7fda3951ae1e (/home/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/lib/libMIOpen.so.1+0x1cfc1e1e)
The issue should be reproducible on MI200 node. The library should be built in debug configuration, with sanitizers enabled, like this:
CXX=/opt/rocm/llvm/bin/clang++ \
CXXFLAGS=-Werror \
cmake \
-DBUILD_DEV=On \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DMIOPEN_TEST_FLAGS="--verbose --disable-verification-cache" \
-DMIOPEN_ENABLE_AI_KERNEL_TUNING=Off \
-DCMAKE_BUILD_TYPE=debug \
-DCMAKE_CXX_FLAGS_DEBUG="-g -fdebug-default-version=4 -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored" \
-DMIOPEN_GPU_SYNC=On \
../..
🔴 For now, let's consider this as a blocker until some initial investigation is done. For example, it is possible that the issue happens only during tuning (but I am not sure). If so, then the urgency of this issue can be lowered to https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_high or even https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal
Command to reproduce:
MIOPEN_FIND_ENFORCE=SEARCH_DB_UPDATE \
MIOPEN_DEBUG_TUNING_ITERATIONS_MAX=5 \
MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL=0 \
MIOPEN_FIND_MODE=normal \
MIOPEN_DEBUG_FIND_ONLY_SOLVER=ConvHipImplicitGemmBwdXdlops \
./bin/test_conv2d \
--float --verbose --disable-forward --disable-backward-weights \
--input 128 64 7 7 --weights 64 64 3 3 --pads_strides_dilations 1 1 1 1 1 1 \
--in_layout NHWC --fil_layout NHWC --out_layout NHWC \
--verbose --disable-verification-cache
To the assignee: After fixing it, check if it works without tuning restrictions (i.e. without MIOPEN_DEBUG_TUNING_ITERATIONS_MAX=5). It's also worth checking this with --half.
[Attribution] @junliume @JehandadKhan
- https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug
- https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_blocker
- Proposed assignees (see https://github.com/ROCmSoftwarePlatform/MIOpen/commits/109670ae90577aa6d795d6ab3e57a5f9a1b739be/src/solver/conv_hip_implicit_gemm_bwd_data_xdlops.cpp):
- @iq136boy
- @carlushuang
- @averinevg
UndefinedBehaviorSanitizer:DEADLYSIGNAL this is more like a CPU hang
The same happens on MI100/MI300. Title updated.
More info from the log (MI100, manually formatted)
[2023-06-02T18:42:26.470Z] MIOpen(HIP): Info [FindSolutionImpl] Starting search: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4)
[2023-06-02T18:42:26.470Z] /var/jenkins/workspace/MLLibs_MIOpen_PR-1911/src/solver
/conv_hip_implicit_gemm_bwd_data_xdlops.cpp:147:39:
runtime error: member call on null pointer of type
'ck::tensor_operation::device::DeviceConvBwdData
< 2
, ck::tensor_layout::convolution::NHWC
, ck::tensor_layout::convolution::KYXC
, ck::tensor_layout::convolution::NHWK
, float
, float
, float
, ck::tensor_operation::element_wise::PassThrough
, ck::tensor_operation::element_wise::PassThrough
, ck::tensor_operation::element_wise::PassThrough>'
[2023-06-02T18:42:26.470Z] #0 0x7feb82e84143 (/var/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/lib/libMIOpen.so.1+0x1cfc8143)
...
[2023-06-02T18:42:26.472Z] #138 0xe43cfd (/var/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/bin/test_conv2d+0xe43cfd)
[2023-06-02T18:42:26.472Z]
[2023-06-02T18:42:26.472Z] SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/jenkins/workspace/MLLibs_MIOpen_PR-1911/src/solver/conv_hip_implicit_gemm_bwd_data_xdlops.cpp:147:39 in
[2023-06-02T18:42:26.472Z] make[7]: *** [test/CMakeFiles/smoke_solver_ConvHipImplicitGemmBwdXdlops.dir/build.make:57: test/CMakeFiles/smoke_solver_ConvHipImplicitGemmBwdXdlops] Error 1
@iq136boy @carlushuang Looks like a false positive, BUT we need some CK expert to look at this.
@carlushuang @zjing14 could you double check on this issue? If false positive, we can close it.
@carlushuang @zjing14 could you double check on this issue? If false positive, we can close it.
Ping
Seems not reproducible so far, lower urgency level, and will assign to CK team member.
@junliume Is this reproducible with latest ROCm 6.1.0? If not, can we close the ticket? Thanks!