MIOpen icon indicating copy to clipboard operation
MIOpen copied to clipboard

[MI200][MI100][MI300] UB of ConvHipImplicitGemmBwdXdlops

Open atamazov opened this issue 2 years ago • 8 comments

Originated from: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/1911#issuecomment-1533884703

...it failed in smoke_solver_ConvHipImplicitGemmBwdXdlops test for the stage Fp32 Hip Debug gfx90a. Here is the error message:

[2023-05-03T19:26:30.939Z] /home/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/bin/test_conv2d --float --cmode conv --pmode default --group-count 1 --disable-forward --disable-backward-weights --input 128 64 7 7 --weights 64 64 3 3 --batch_size 128 --input_channels 64 --output_channels 64 --spatial_dim_elements 7 7 --filter_dims 3 3 --pads_strides_dilations 1 1 1 1 1 1 --trans_output_pads 0 0 --in_layout NHWC --fil_layout NHWC --out_layout NHWC --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0 
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [GetWorkSpaceSize] 0
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindConvBwdDataAlgorithm] requestAlgoCount = 1, workspace = 0
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [Measure] RamDb::Prefetch time: 0.16479 ms
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [TryLoad] Find-db regenerating.
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [GetPerfDbPathFile] Found exact perf database file
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindSolutionImpl] ConvHipImplicitGemmBwdXdlops
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Warning [FindSolutionImpl] Perf Db: load skipped: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4)
[2023-05-03T19:26:30.939Z] MIOpen(HIP): Info [FindSolutionImpl] Starting search: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4)
[2023-05-03T19:26:30.939Z] UndefinedBehaviorSanitizer:DEADLYSIGNAL
[2023-05-03T19:26:30.939Z] ==189537==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address (pc 0x7fda3951ae1e bp 0x7ffe35f2cce0 sp 0x7ffe35f2c990 T189537)
[2023-05-03T19:26:30.939Z] ==189537==The signal is caused by a READ memory access.
[2023-05-03T19:26:30.939Z] ==189537==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
[2023-05-03T19:26:30.939Z]     #0 0x7fda3951ae1e  (/home/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/lib/libMIOpen.so.1+0x1cfc1e1e)

The issue should be reproducible on MI200 node. The library should be built in debug configuration, with sanitizers enabled, like this:

CXX=/opt/rocm/llvm/bin/clang++ \
CXXFLAGS=-Werror \
cmake \
-DBUILD_DEV=On \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DMIOPEN_TEST_FLAGS="--verbose --disable-verification-cache" \
-DMIOPEN_ENABLE_AI_KERNEL_TUNING=Off \
-DCMAKE_BUILD_TYPE=debug \
-DCMAKE_CXX_FLAGS_DEBUG="-g -fdebug-default-version=4 -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored" \
-DMIOPEN_GPU_SYNC=On \
../..

🔴 For now, let's consider this as a blocker until some initial investigation is done. For example, it is possible that the issue happens only during tuning (but I am not sure). If so, then the urgency of this issue can be lowered to https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_high or even https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal

Command to reproduce:

MIOPEN_FIND_ENFORCE=SEARCH_DB_UPDATE \
MIOPEN_DEBUG_TUNING_ITERATIONS_MAX=5 \
MIOPEN_DEBUG_CONVOLUTION_ATTRIB_FP16_ALT_IMPL=0 \
MIOPEN_FIND_MODE=normal \
MIOPEN_DEBUG_FIND_ONLY_SOLVER=ConvHipImplicitGemmBwdXdlops \
./bin/test_conv2d \
--float --verbose --disable-forward --disable-backward-weights \
--input 128 64 7 7 --weights 64 64 3 3 --pads_strides_dilations 1 1 1 1 1 1 \
--in_layout NHWC --fil_layout NHWC --out_layout NHWC \
--verbose --disable-verification-cache

To the assignee: After fixing it, check if it works without tuning restrictions (i.e. without MIOPEN_DEBUG_TUNING_ITERATIONS_MAX=5). It's also worth checking this with --half.


[Attribution] @junliume @JehandadKhan

  • https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug
  • https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_blocker
  • Proposed assignees (see https://github.com/ROCmSoftwarePlatform/MIOpen/commits/109670ae90577aa6d795d6ab3e57a5f9a1b739be/src/solver/conv_hip_implicit_gemm_bwd_data_xdlops.cpp):
    • @iq136boy
    • @carlushuang
    • @averinevg

atamazov avatar May 26 '23 23:05 atamazov

UndefinedBehaviorSanitizer:DEADLYSIGNAL this is more like a CPU hang

carlushuang avatar May 31 '23 07:05 carlushuang

The same happens on MI100/MI300. Title updated.

atamazov avatar Jun 05 '23 15:06 atamazov

More info from the log (MI100, manually formatted)

[2023-06-02T18:42:26.470Z] MIOpen(HIP): Info [FindSolutionImpl] Starting search: ConvHipImplicitGemmBwdXdlops, enforce: SEARCH_DB_UPDATE(4)
[2023-06-02T18:42:26.470Z] /var/jenkins/workspace/MLLibs_MIOpen_PR-1911/src/solver
  /conv_hip_implicit_gemm_bwd_data_xdlops.cpp:147:39:
  runtime error: member call on null pointer of type 
'ck::tensor_operation::device::DeviceConvBwdData
  < 2
  , ck::tensor_layout::convolution::NHWC
  , ck::tensor_layout::convolution::KYXC
  , ck::tensor_layout::convolution::NHWK
  , float
  , float
  , float
  , ck::tensor_operation::element_wise::PassThrough
  , ck::tensor_operation::element_wise::PassThrough
  , ck::tensor_operation::element_wise::PassThrough>'
[2023-06-02T18:42:26.470Z]     #0 0x7feb82e84143  (/var/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/lib/libMIOpen.so.1+0x1cfc8143)
...
[2023-06-02T18:42:26.472Z]     #138 0xe43cfd  (/var/jenkins/workspace/MLLibs_MIOpen_PR-1911/build/bin/test_conv2d+0xe43cfd)
[2023-06-02T18:42:26.472Z] 
[2023-06-02T18:42:26.472Z] SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/jenkins/workspace/MLLibs_MIOpen_PR-1911/src/solver/conv_hip_implicit_gemm_bwd_data_xdlops.cpp:147:39 in 
[2023-06-02T18:42:26.472Z] make[7]: *** [test/CMakeFiles/smoke_solver_ConvHipImplicitGemmBwdXdlops.dir/build.make:57: test/CMakeFiles/smoke_solver_ConvHipImplicitGemmBwdXdlops] Error 1

atamazov avatar Jun 05 '23 16:06 atamazov

@iq136boy @carlushuang Looks like a false positive, BUT we need some CK expert to look at this.

atamazov avatar Jun 05 '23 16:06 atamazov

@carlushuang @zjing14 could you double check on this issue? If false positive, we can close it.

junliume avatar Jul 12 '23 00:07 junliume

@carlushuang @zjing14 could you double check on this issue? If false positive, we can close it.

Ping

atamazov avatar Sep 29 '23 23:09 atamazov

Seems not reproducible so far, lower urgency level, and will assign to CK team member.

junliume avatar Dec 25 '23 09:12 junliume

@junliume Is this reproducible with latest ROCm 6.1.0? If not, can we close the ticket? Thanks!

ppanchad-amd avatar Apr 23 '24 16:04 ppanchad-amd