cuda-python Correct usage of `cuda.core.

I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:

device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()

Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:

cublaslt.matmul(
    self.handle,
    self.mm_desc,
    self.alpha.ctypes.data,
    a.data_ptr,
    self.a_layout_ptr,
    b.data_ptr,
    self.b_layout_ptr,
    self.beta.ctypes.data,
    c_ptr,
    self.c_layout_ptr,
    self.result.data_ptr,
    self.d_layout_ptr,
    algorithm_struct.ctypes.data,
    raw_workspace_ptr,  # pointer here
    self.workspace_size,  # same size used here as to allocate the buffer
    stream_holder.ptr,
)

The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:

========= Invalid __global__ write of size 4 bytes
=========     at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f1ac5d49420 is out of bounds
=========     and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x7ee3eb5] in libcublasLt.so.12
=========         Host Frame:  [0x7f4a3f7] in libcublasLt.so.12
=========         Host Frame:  [0x1b1ab14] in libcublasLt.so.12
=========         Host Frame:  [0x1b1c010] in libcublasLt.so.12
=========         Host Frame:  [0xf81c1d] in libcublasLt.so.12
=========         Host Frame:  [0x10c0b58] in libcublasLt.so.12
=========         Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
=========         Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
=========         Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so

It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.

Looking for opinions on:

Whether I am allocating / using this Buffer in the expected manner
How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt

Apr 11 '25 01:04 carterbox

I should check that the address reported by compute sanitizer is near the integer pointer address that I get from the Buffer object.

Apr 11 '25 01:04 carterbox

@carterbox I think the simplest example would look something like:

from cuda.core.experimental import Device

size = ...  # define this

device = Device()
device.set_current()
stream = device.create_stream()

buffer1 = device.allocate(size=size, stream=stream)
buffer2 = device.allocate(size=size, stream=stream)
buffer1.copy_to(buffer2, stream=stream)

Based on your error above my best guess would be that something is either wrong with the size you're passing into the device.allocate call, or the value of stream_holder.ptr isn't referring to the same stream as stream.

Apr 11 '25 02:04 kkraus14

I checking that the streams used at allocation and use-time were the same and noticed that the stream pointers I was getting from CuPy/Torch were not the same as the ones from cuda.core for the same stream. This lead me to realize that I was doing something wrong when converting cuda.core objects from python objects into addresses of the underlying C objects. For example:

raw_workspace_ptr: int = buffer.handle.getPtr()

This is incorrect! Because it returns the pointer to the python cuda.bindings object not the address of actual memory buffer. Instead we should do this:

raw_workspace_ptr: int = int(buffer.handle)

Which I guess is pythonic, but also not obvious or documented in the documentation of Buffer or Stream.

Apr 11 '25 18:04 carterbox

This is documented here: https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#getting-the-address-of-underlying-c-objects-from-the-low-level-bindings

But I agree this isn't the most clear and is prone to exactly the situation you ran into.

Apr 11 '25 20:04 kkraus14

I'm thinking I want to contribute a documentation fix which either:

Adds a note to all the cuda.core classes that you need to call int(Class().handle) to get the pointer address of the C object.
Adds a docstring to __int__() for all cuda.bindings classes so that it's obvious from the documentation page that this is a valid operation. I maintain that it's not obvious enough that this operator is defined for theses classes.

Number 2 is probably the better approach?

Apr 15 '25 16:04 carterbox

@carterbox this was discussed in an offline meeting and it was generally agreed that we aren't happy with the current state of things with regards to getPtr() vs __int__() and the lack of intuitiveness for a Python developer.

I'm going to write up a new issue and close this one that captures the discussion and some next steps.

Apr 15 '25 18:04 kkraus14

it was generally agreed that we aren't happy with the current state of things with regards to getPtr() vs __int__() and the lack of intuitiveness for a Python developer.

I'm going to write up a new issue and close this one that captures the discussion and some next steps.

We discussed further offline, and to move away from __int__() we will implement #564. @carterbox could you make necessary doc changes to cuda.core to clarify the status quo, as you suggested earlier?

Apr 22 '25 17:04 leofang

Correct usage of `cuda.core._memory.Buffer`?