cutlass [BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0

Description of the bug:

Affected versions are 3.4.0 and 3.6.0 and in between.

When using example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, and linking with some other code (attached to this report) I've got the error:

/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=507904
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=1048576, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=2097152
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=240, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=960
/workspace/src/cutlass/include/cutlass/gemm/kernel/params_universal_base.h:95  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:191  GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:310  GemmUniversalBase::get_workspace_size()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:190    device_ordinal: (0), device_sms: (108), sm_occupancy: (2) smem_size: (81920) GemmKernel::kThreadCount: (128)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:321    workspace_bytes: 0
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=0
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:271  GemmUniversalBase::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:329  GemmUniversalBase::get_grid_shape()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:342    tiled_shape: cutlass::gemm::GemmCoord {2,2,1}
  grid_dims: {2, 2, 1}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:368  GemmUniversal::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:438    returning kSuccess
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:406  GemmUniversalBase::initialize() - workspace 0, stream: null
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132  GemmUniversalBase::init_device_props()
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181  cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189  cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197  cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77  cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:434  GemmUniversalBase::run()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:444    grid: (2, 2, 1), block: (128, 1, 1), SMEM: (81920)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:466    grid launch failed with error invalid argument
Got cutlass error: Error Internal at: 387

Steps to reproduce:

I've taken the official example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, this file remains unchanged.
Just use attached gather_scatter_fusion1.cu or: 2.1) Made a copy of the example into gather_scatter_fusion1.cu. 2.2) Remove the main function and rename run function to run2 in gather_scatter_fusion1.cu. 2.3) Remove some code from the ending of the function run2 until status = gemm_op(); statement, not keeping it.
Compile both .cu files into one executable & run.

IMPORTANT: gather_scatter_fusion1.cu MUST be the first in nvcc command to reproduce the error, otherwise everything works fine!

nvcc -O0 -arch=native -ccbin=clang --expt-relaxed-constexpr -I./cutlass/include -I./cutlass/tools/util/include gather_scatter_fusion1.cu gather_scatter_fusion.cu -o out
./out

Output result:

Got cutlass error: Error Internal at: 387

But expected:

Passed!
Runtime: 0.0388416 ms
 GFLOPs: 3138.31

Additional notes:

IMPORTANT: Only in case when the template parameters of cutlass::gemm::device::GemmUniversal are exactly the same in both .cu files the error is occurring (that means the sass code exactly the same in both .o files).
The error occurs on clang with -O2 or -O0 flags, and on gcc with -O0. (gcc with -O2 runs as expected at least with this version of code)
None of the functions are called from gather_scatter_fusion1.cu during the test, existence of the function “run2” is enough to break the program.
The last CUDA API call is the cudaLaunchKernel, in debugger the arguments to it looks reasonable.
cudaGetLastError() returns cudaErrorInvalidValue
I've found the problematic commit using git bisect, it is 8236f30675bbe98f81d11c05764b77bfcb25b8cc (this is release of 3.4.0 version),
Since the source code of the individual commits of this huge PR is not available I cannot investigate the error further.

Environment:

GPU: A100 nvidia-smi: 470.161.03 CUDA Version: 11.4

gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) clang version: 18.1.3 (1ubuntu1)

Docker container: nvcr.io/nvidia/cuda:12.6.3-devel-ubuntu24.04

Also reproduced on: nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 with corresponding default versions of tools/compilers.

gather_scatter_fusion1.cu.txt gather_scatter_fusion.cu.txt

cc: @IonThruster

Dec 28 '24 17:12 warpuv

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

Dec 29 '24 01:12 thakkarV

Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version

Hello @thakkarV , same issue when using CMake to build.

1. Copied file gather_scatter_fusion1.cu to the cutlass/examples/36_gather_scatter_fusion directory 2. Added gather_scatter_fusion1.cu line to the cutlass/examples/36_gather_scatter_fusion/CMakeLists.txt file like this:

cutlass_example_add_executable(
  36_gather_scatter_fusion
  gather_scatter_fusion1.cu
  gather_scatter_fusion.cu
  )

Again gather_scatter_fusion1.cu, must be the first in the list.

3. Run the commands:

cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCMAKE_BUILD_TYPE=Debug
cmake --build . --target test_examples_36_gather_scatter_fusion

4. Got the error:

Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion1.cu.o
Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion.cu.o
Linking CUDA executable 36_gather_scatter_fusion
Built target 36_gather_scatter_fusion
Got cutlass error: Error Internal at: 387
gmake[3]: *** [examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/build.make:70: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:29530: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:29537: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/rule] Error 2
gmake: *** [Makefile:12082: test_examples_36_gather_scatter_fusion] Error 2

nvcc --version output:

Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

Dec 29 '24 12:12 warpuv

Does it work correctly if you use a release build instead of debug?

Dec 29 '24 13:12 thakkarV

Does it work correctly if you use a release build instead of debug?

Yes, it works correctly when using GCC host compiler (Release configuration)

Dec 29 '24 13:12 warpuv

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Feb 02 '25 12:02 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

May 03 '25 12:05 github-actions[bot]

[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛

Description of the bug:

Steps to reproduce:

Additional notes:

Environment: