cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

Querying current device is slow compared to CuPy

Open shwina opened this issue 11 months ago • 7 comments

Getting the current device using cuda.core is quite a bit slower than CuPy:

In [1]: import cupy as cp

In [2]: %timeit cp.cuda.Device()
69 ns ± 0.496 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: from cuda.core.experimental import Device

In [4]: %timeit Device()
795 ns ± 0.273 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Ultimately, my goal is to get the compute capability of the current device, and this is even slower:

In [5]: %timeit cp.cuda.Device().compute_capability
89.1 ns ± 0.413 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [6]: %timeit Device().compute_capability
2.64 μs ± 122 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Are there tricks (e.g., caching) CuPy is employing here that cuda.core can use as well? Alternately, is there another way for me to use cuda.core or cuda.bindings to get this information quickly? Note that for my use case, I'm not super concerned about the first call to Device(), but I do want subsequent calls to be trivially inexpensive if the current device hasn't changed.


Using the low-level cuda.bindings is also not quite as fast:

In [11]: def get_cc():
    ...:     dev = runtime.cudaGetDevice()[1]
    ...:     return driver.cuDeviceComputeCapability(dev)
    ...:

In [12]: get_cc()
Out[12]: (<CUresult.CUDA_SUCCESS: 0>, 7, 5)

In [13]: %timeit get_cc()
597 ns ± 0.494 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

shwina avatar Feb 07 '25 16:02 shwina

We'll have to cache CC on a per-Device object level to bring this down to O(10) ns level.

In [32]: def get_cc(dev):
    ...:    if dev in data:
    ...:        return data[dev]
    ...:    data[dev] = (driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, dev)[1],
    ...:                 driver.cuDeviceGetAttribute(driver.CUdevice_attribute.CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, dev)[1])
    ...:    return data[dev]
    ...: 

In [33]: 

In [33]: get_cc(1)
Out[33]: (12, 0)

In [36]: %timeit get_cc(1)
51.7 ns ± 0.0214 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [37]: %timeit cp.cuda.Device().compute_capability
179 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

which is also what CuPy does internally: https://github.com/cupy/cupy/blob/1f9c9d4d1eb2edcbeb2a9294def57c2252e18b92/cupy/cuda/device.pyx#L213-L214

leofang avatar Feb 14 '25 04:02 leofang

I did some refactoring of Device.__new__() to replace cudart APIs by driver APIs, and found the perf gets even worse. Out of curiosity, I did this quick profiling and got very surprised: (the following already has the primary context set to current)

In [19]: %timeit runtime.cudaGetDevice()
338 ns ± 0.463 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [20]: %timeit driver.cuCtxGetDevice()
406 ns ± 1.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [21]: %timeit cp.cuda.runtime.getDevice()
112 ns ± 0.822 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

A simple get-device call using cuda.bindings is 3-4x slower than CuPy. @vzhurba01 have we seen something like this before?

leofang avatar Feb 21 '25 02:02 leofang

Accessing Device().compute_capability is being addressed in #459. Let me re-label this issue to track the remaining binding performance issue.

leofang avatar Feb 24 '25 14:02 leofang

@rwgk reported that cuDriverGetVersion is also sluggish when called repeatedly in a busy loop

leofang avatar Feb 27 '25 20:02 leofang

This line accounts for about 82% of the runtime of cuda.core.experimental.Device():

https://github.com/NVIDIA/cuda-python/blob/8cda903b365e248af9008d460a718978ea3d649c/cuda_core/cuda/core/experimental/_device.py#L960

I figured that out by replacing that line with hard-wired device_id = 0.

rwgk avatar Mar 12 '25 23:03 rwgk

Small update:

I made a few trivial performance changes that helped quite a bit, then compared the performance with and without this diff, all else the same:

-                err, device_id = runtime.cudaGetDevice()                        
-                assert err == driver.CUresult.CUDA_SUCCESS                      
+                device_id = 0  # hard-wired
cupy.cuda.Device()      0.07 µs per call
cuda.bindings Device()  0.33 µs per call  without that diff
cuda.bindings Device()  0.08 µs per call  with that diff

I.e. almost the entire remaining performance difference is due to runtime.cudaGetDevice().

rwgk avatar Mar 12 '25 23:03 rwgk

Yes, see https://github.com/NVIDIA/cuda-python/issues/439#issuecomment-2673234572. Right now the problem is in cuda.bindings, not cuda.core. I had changed the issue label to reflect this status.

leofang avatar Mar 13 '25 04:03 leofang

Here's my investigation report.

Version 1. This is a minimized version of cuCtxGetDevice to judge the speed of light. CUdevice is created by user and we skip returning the error code:

def cuCtxGetDevice(device : CUdevice):
    cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)

with results: 62.1 ns ± 0.00271 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Version 2. Creates the CUdevice, but skips returning the error code:

def cuCtxGetDevice():
    cdef CUdevice device = CUdevice()
    err = cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)
    return (None, device)

with results: 128 ns ± 1.19 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Version 3. Add return error code:

def cuCtxGetDevice(device : CUdevice):
    err = cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)
    return (CUresult(err), device)

with results: 390 ns ± 9.64 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

The problem is the creation of class CUresult.... it's really really slow:

In [2]: %timeit driver.CUresult(0)
282 ns ± 5.81 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Here's one more version of cuCtxGetDevice: Version 4. Return the error code as an integer.

def cuCtxGetDevice():
    cdef CUdevice device = CUdevice()
    cdef cydriver.CUresult result = CUresult.CUDA_SUCCESS
    result = cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)
    return (result, device)

With results: 153 ns ± 0.036 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

With regards to the next step... I do see that calling returning the enum directly gives better results:

In [5]: %timeit driver.CUresult.CUDA_SUCCESS
31.6 ns ± 1.31 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Therefore I propose the following version: Version 5. Add fast pass for API success:

def cuCtxGetDevice():
    cdef CUdevice device = CUdevice()
    err = cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)
    if err == cydriver.cudaError_enum.CUDA_SUCCESS:
        return (CUresult.CUDA_SUCCESS, device)
    return (CUresult(err), device)

With results: 150 ns ± 1.35 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

This gives our most common and important case much better performance, while the perf drop in case of an error is less significant.

vzhurba01 avatar Apr 02 '25 23:04 vzhurba01

Wow! Great findings Vlad! It is insane how slow IntEnum (or any Enum-subclasses from the standard library) is...

I wonder if it makes sense to build an internal cache ourselves? The built-in dict lookup is very fast (>10x). Something like

_m = dict(((int(v), v) for k, v in driver.CUresult.__members__.items()))

def cuCtxGetDevice():
    cdef CUdevice device = CUdevice()
    err = int(cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr))
    return (_m(err), device)

This is reasonably fast from what I see:

In [51]: m = dict(((int(v), v) for k, v in driver.CUresult.__members__.items()))

In [52]: %timeit m[100]
20.5 ns ± 0.0777 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [53]: %timeit driver.CUresult(100)  # for comparison
254 ns ± 1.06 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Alternatively, we can drop the built-in IntEnum entirely, in favor of the counterpart that @oleksandr-pavlyk wrote recently (in https://github.com/NVIDIA/cccl/pull/4325). Though I haven't tested its perf, and this is potentially an API breaking change (the behavior might not be fully duck-typing IntEnum). WDYT?

leofang avatar Apr 03 '25 02:04 leofang

(Your fast path is also reasonable FWIW, just wonder if this is worth our efforts.)

leofang avatar Apr 03 '25 02:04 leofang

My take from comparing version 1 and version 2 is that we wasted 100% overhead (60->120ns) just to create a tuple... We may want to think seriously about breaking the API in the next major release.

leofang avatar Apr 03 '25 02:04 leofang

just to create a tuple

Here's one more version: Version 5. No tuples, but create a new CUdevice

def cuCtxGetDevice():
    cdef CUdevice device = CUdevice()
    err = cydriver.cuCtxGetDevice(<cydriver.CUdevice*>device._pvt_ptr)
    if err == cydriver.cudaError_enum.CUDA_SUCCESS:
        return device
    return device

With results: 115 ns ± 0.71 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Output arguments also add overhead.

vzhurba01 avatar Apr 03 '25 19:04 vzhurba01

My take from comparing version 1 and version 2 is that we wasted 100% overhead (60->120ns) just to create a tuple...

I read it wrong. Creating the return tuple is reasonable (~10 ns).

leofang avatar Apr 03 '25 19:04 leofang