cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

CUDA Python: Performance meets Productivity

Results 261 cuda-python issues
Sort by recently updated
recently updated
newest added

~~Blocked by #459 & https://github.com/NVIDIA/cuda-python/issues/439#issuecomment-2673234572.~~ Before this PR: ```python In [7]: %timeit Device() 622 ns ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)...

enhancement
P1
cuda.core

> Indeed we're duplicating some of the information. It might take a few iterations to consolidate the two files and have something presentable on both PyPI and GitHub. Let's address...

documentation
P1
cuda.bindings
cuda.core

As discussed offline, the search logic here https://github.com/NVIDIA/cuda-python/blob/19563d59dd3f349e365fb57f3d95f1e7af649ad9/cuda_bindings/cuda/bindings/_internal/nvjitlink_windows.pyx#L57-L87 does not look for conda installations , which has the following path: ``` %CONDA_PREFIX%\Library\nvvm\lib\x64\nvvm.lib ``` One way to fix it is to...

bug
P0
cuda.bindings

no-cache, for example, is not supported when using an environment with cuda 12.4 . Find a compatibility matrix and apply it to supported options when users construct a LinkerOptions instance....

enhancement
triage
P1
cuda.core

Markdown files are harder to cross reference, and not possible to reference Sphinx objects

documentation
P1
cuda.bindings
cuda.core

We support - Python scalars - NumPy scalars - ctypes scalars These should all be tested in `test_launcher.py`.

triage
P1
test
cuda.core

We wanted to do this but it seems so far we've only covered `Stream.from_handle`. We want to also cover these objects: - `Program` - `ObjectCode` - `Kernel` - `Buffer`

triage
P1
feature
cuda.core

- Support ctypes/numpy structs - make sure ctypes is deprioritized - Support converting arbitrary objects to `StridedMemoryView` - Benchmarking - measure `launch()` overhead - reimplement type dispatcher via dict lookup...

enhancement
triage
P1
cuda.core

> This is a bit trickier than I thought, because we also need the dict key "old"/"new" as a proxy to prepare for `args` (which is different for `cuModuleLoadDataEx`/`cuLibraryLoadData`). Let's...

enhancement
P2
cuda.core

Currently it takes 3.4 - 3.45 us (depending on stream-ordering or not) to create a memory view object: ```python In [4]: x = cp.empty((23, 4)) In [7]: %timeit s =...

enhancement
triage
P1
cuda.core