Correct usage of `cuda.core._memory.Buffer`?
I am trying to allocate workspace for cublaslt using cuda.core. First, I allocate a memory Buffer like so:
device = Device()
device.set_current()
buffer = device.allocate(size=size, stream=stream)
raw_workspace_ptr: int = buffer.handle.getPtr()
Then later I pass this pointer to cublaslt via the nvmath-python bindings like so:
cublaslt.matmul(
self.handle,
self.mm_desc,
self.alpha.ctypes.data,
a.data_ptr,
self.a_layout_ptr,
b.data_ptr,
self.b_layout_ptr,
self.beta.ctypes.data,
c_ptr,
self.c_layout_ptr,
self.result.data_ptr,
self.d_layout_ptr,
algorithm_struct.ctypes.data,
raw_workspace_ptr, # pointer here
self.workspace_size, # same size used here as to allocate the buffer
stream_holder.ptr,
)
The problem is that when I use this Buffer abstraction from cuda.core, I get errors from CUDA runtime. For example, when running with compute-sanitizer:
========= Invalid __global__ write of size 4 bytes
========= at void dot_kernel<float, (int)128, (int)0, cublasDotParams<cublasGemvTensorStridedBatched<const float>, cublasGemvTensorStridedBatched<float>>>(T4)+0xd70
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7f1ac5d49420 is out of bounds
========= and is 139697130345438 bytes after the nearest allocation at 0xd00000000 of size 67 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x7ee3eb5] in libcublasLt.so.12
========= Host Frame: [0x7f4a3f7] in libcublasLt.so.12
========= Host Frame: [0x1b1ab14] in libcublasLt.so.12
========= Host Frame: [0x1b1c010] in libcublasLt.so.12
========= Host Frame: [0xf81c1d] in libcublasLt.so.12
========= Host Frame: [0x10c0b58] in libcublasLt.so.12
========= Host Frame: cublasLtMatmul [0x10c4dcc] in libcublasLt.so.12
========= Host Frame: __pyx_f_6nvmath_8bindings_10cycublasLt_cublasLtMatmul(void*, void*, void const*, void const*, void*, void const*, void*, void const*, void const*, void*, void*, void*, cublasLtMatmulAlgo_t const*, void*, unsigned long, CUstream_st*) [0x57b5] in cycublasLt.cpython-312-x86_64-linux-gnu.so
========= Host Frame: __pyx_f_6nvmath_8bindings_8cublasLt_matmul(long, long, long, long, long, long, long, long, long, long, long, long, long, long, unsigned long, long, int) [0x5ca7d] in cublasLt.cpython-312-x86_64-linux-gnu.so
========= Host Frame: __pyx_pw_6nvmath_8bindings_8cublasLt_13matmul(_object*, _object* const*, long, _object*) [0x78fae] in cublasLt.cpython-312-x86_64-linux-gnu.so
It seems to be reporting that the buffer is an invalid memory address. When I use the allocators provided by CuPy or pytorch, there are no errors.
Looking for opinions on:
- Whether I am allocating / using this Buffer in the expected manner
- How I could create a reproducer or another memory validator test that doesn't require setting up and entire matmul for cublaslt
I should check that the address reported by compute sanitizer is near the integer pointer address that I get from the Buffer object.
@carterbox I think the simplest example would look something like:
from cuda.core.experimental import Device
size = ... # define this
device = Device()
device.set_current()
stream = device.create_stream()
buffer1 = device.allocate(size=size, stream=stream)
buffer2 = device.allocate(size=size, stream=stream)
buffer1.copy_to(buffer2, stream=stream)
Based on your error above my best guess would be that something is either wrong with the size you're passing into the device.allocate call, or the value of stream_holder.ptr isn't referring to the same stream as stream.
I checking that the streams used at allocation and use-time were the same and noticed that the stream pointers I was getting from CuPy/Torch were not the same as the ones from cuda.core for the same stream. This lead me to realize that I was doing something wrong when converting cuda.core objects from python objects into addresses of the underlying C objects. For example:
raw_workspace_ptr: int = buffer.handle.getPtr()
This is incorrect! Because it returns the pointer to the python cuda.bindings object not the address of actual memory buffer. Instead we should do this:
raw_workspace_ptr: int = int(buffer.handle)
Which I guess is pythonic, but also not obvious or documented in the documentation of Buffer or Stream.
This is documented here: https://nvidia.github.io/cuda-python/cuda-bindings/latest/tips_and_tricks.html#getting-the-address-of-underlying-c-objects-from-the-low-level-bindings
But I agree this isn't the most clear and is prone to exactly the situation you ran into.
I'm thinking I want to contribute a documentation fix which either:
- Adds a note to all the cuda.core classes that you need to call
int(Class().handle)to get the pointer address of the C object. - Adds a docstring to
__int__()for all cuda.bindings classes so that it's obvious from the documentation page that this is a valid operation. I maintain that it's not obvious enough that this operator is defined for theses classes.
Number 2 is probably the better approach?
@carterbox this was discussed in an offline meeting and it was generally agreed that we aren't happy with the current state of things with regards to getPtr() vs __int__() and the lack of intuitiveness for a Python developer.
I'm going to write up a new issue and close this one that captures the discussion and some next steps.
it was generally agreed that we aren't happy with the current state of things with regards to
getPtr()vs__int__()and the lack of intuitiveness for a Python developer.I'm going to write up a new issue and close this one that captures the discussion and some next steps.
We discussed further offline, and to move away from __int__() we will implement #564. @carterbox could you make necessary doc changes to cuda.core to clarify the status quo, as you suggested earlier?