[BUG] Memory corruption/undefined behavior on GemmUniversal in 3.4.0 - 3.6.0 🐛
Description of the bug:
Affected versions are 3.4.0 and 3.6.0 and in between.
When using example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, and linking with some other code (attached to this report) I've got the error:
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=507904
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=1048576, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(1048576)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=2097152
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=240, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(240)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=960
/workspace/src/cutlass/include/cutlass/gemm/kernel/params_universal_base.h:95 GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:191 GemmUniversal::Arguments::Arguments() - problem_size: cutlass::gemm::GemmCoord {248,240,1024}
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:310 GemmUniversalBase::get_workspace_size()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132 GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:190 device_ordinal: (0), device_sms: (108), sm_occupancy: (2) smem_size: (81920) GemmKernel::kThreadCount: (128)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:321 workspace_bytes: 0
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=0
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:271 GemmUniversalBase::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:329 GemmUniversalBase::get_grid_shape()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132 GemmUniversalBase::init_device_props()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:342 tiled_shape: cutlass::gemm::GemmCoord {2,2,1}
grid_dims: {2, 2, 1}
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:368 GemmUniversal::can_implement()
/workspace/src/cutlass/include/cutlass/gemm/kernel/gemm_universal.h:438 returning kSuccess
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:406 GemmUniversalBase::initialize() - workspace 0, stream: null
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:132 GemmUniversalBase::init_device_props()
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:181 cutlass::HostTensor::reserve(count=253952, device_backed_=true)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:189 cutlass::HostTensor::reserve: host_.resize(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/host_tensor.h:197 cutlass::HostTensor::reserve: device_memory::allocate(253952)
/workspace/src/cutlass/tools/util/include/cutlass/util/device_memory.h:77 cutlass::device_memory::allocate: Successful cudaMalloc: bytes=1015808
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:434 GemmUniversalBase::run()
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:444 grid: (2, 2, 1), block: (128, 1, 1), SMEM: (81920)
/workspace/src/cutlass/include/cutlass/gemm/device/gemm_universal_base.h:466 grid launch failed with error invalid argument
Got cutlass error: Error Internal at: 387
Steps to reproduce:
- I've taken the official example cutlass/examples/36_gather_scatter_fusion/gather_scatter_fusion.cu, this file remains unchanged.
- Just use attached gather_scatter_fusion1.cu or:
2.1) Made a copy of the example into gather_scatter_fusion1.cu.
2.2) Remove the main function and rename run function to run2 in gather_scatter_fusion1.cu.
2.3) Remove some code from the ending of the function run2 until
status = gemm_op();statement, not keeping it. - Compile both .cu files into one executable & run.
IMPORTANT: gather_scatter_fusion1.cu MUST be the first in nvcc command to reproduce the error, otherwise everything works fine!
nvcc -O0 -arch=native -ccbin=clang --expt-relaxed-constexpr -I./cutlass/include -I./cutlass/tools/util/include gather_scatter_fusion1.cu gather_scatter_fusion.cu -o out
./out
Output result:
Got cutlass error: Error Internal at: 387
But expected:
Passed!
Runtime: 0.0388416 ms
GFLOPs: 3138.31
Additional notes:
- IMPORTANT: Only in case when the template parameters of cutlass::gemm::device::GemmUniversal are exactly the same in both .cu files the error is occurring (that means the sass code exactly the same in both .o files).
- The error occurs on clang with -O2 or -O0 flags, and on gcc with -O0. (gcc with -O2 runs as expected at least with this version of code)
- None of the functions are called from gather_scatter_fusion1.cu during the test, existence of the function “run2” is enough to break the program.
- The last CUDA API call is the
cudaLaunchKernel, in debugger the arguments to it looks reasonable. -
cudaGetLastError()returnscudaErrorInvalidValue - I've found the problematic commit using git bisect, it is 8236f30675bbe98f81d11c05764b77bfcb25b8cc (this is release of 3.4.0 version),
- Since the source code of the individual commits of this huge PR is not available I cannot investigate the error further.
Environment:
GPU: A100 nvidia-smi: 470.161.03 CUDA Version: 11.4
gcc version: 13.3.0 (Ubuntu 13.3.0-6ubuntu2~24.04) clang version: 18.1.3 (1ubuntu1)
Docker container:
nvcr.io/nvidia/cuda:12.6.3-devel-ubuntu24.04
Also reproduced on:
nvcr.io/nvidia/cuda:12.4.1-devel-ubuntu22.04 with corresponding default versions of tools/compilers.
gather_scatter_fusion1.cu.txt gather_scatter_fusion.cu.txt
cc: @IonThruster
Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version
Looks like you're rolling your own build system and command line flags. We don't support that. Are you able to repot using our build flags and cmake. Also please specify your CUDA toolkit version
Hello @thakkarV , same issue when using CMake to build.
1. Copied file gather_scatter_fusion1.cu to the cutlass/examples/36_gather_scatter_fusion directory 2. Added gather_scatter_fusion1.cu line to the cutlass/examples/36_gather_scatter_fusion/CMakeLists.txt file like this:
cutlass_example_add_executable(
36_gather_scatter_fusion
gather_scatter_fusion1.cu
gather_scatter_fusion.cu
)
Again gather_scatter_fusion1.cu, must be the first in the list.
3. Run the commands:
cd cutlass
mkdir build && cd build
cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCMAKE_BUILD_TYPE=Debug
cmake --build . --target test_examples_36_gather_scatter_fusion
4. Got the error:
Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion1.cu.o
Building CUDA object examples/36_gather_scatter_fusion/CMakeFiles/36_gather_scatter_fusion.dir/gather_scatter_fusion.cu.o
Linking CUDA executable 36_gather_scatter_fusion
Built target 36_gather_scatter_fusion
Got cutlass error: Error Internal at: 387
gmake[3]: *** [examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/build.make:70: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion] Error 1
gmake[2]: *** [CMakeFiles/Makefile2:29530: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/all] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:29537: examples/36_gather_scatter_fusion/CMakeFiles/test_examples_36_gather_scatter_fusion.dir/rule] Error 2
gmake: *** [Makefile:12082: test_examples_36_gather_scatter_fusion] Error 2
nvcc --version output:
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Oct_29_23:50:19_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
Does it work correctly if you use a release build instead of debug?
Does it work correctly if you use a release build instead of debug?
Yes, it works correctly when using GCC host compiler (Release configuration)
This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.
This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.