cuda-python
cuda-python copied to clipboard
CUDA Python: Performance meets Productivity
- [ ] Users can set `cudaLaunchAttributeProgrammaticStreamSerialization` to do PDL. - [ ] PDL launches are graph-compatible and this use case should be tested and showcased
P0: - [cudaMemcpyBatchAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g6126baf5d881835091c59e48890d6854) P1: - [cudaMemDiscardBatchAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g5acb7cea41bb9115f10568cc8176f51f) - [cudaMemPrefetchBatchAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge4fa23c9a26c6e5e702cbe35d001d589) - [cudaMemDiscardAndPrefetchBatchAsync](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g0f6d2e27d8f00ee78c5d814f45500605) https://docs.nvidia.com/cuda/cuda-programming-guide/03-advanced/advanced-host-programming.html#batched-memory-transfers
This task should cover updating both whole graphs and individual graph nodes.
The fun part would be: How to keep a generic Python object alive? https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html#cuda-user-objects
For example, we released `cuda-bindings` and `cuda-python` 13.1.0 yesterday, but we did not add `13.1.0-notes.rst` to https://github.com/NVIDIA/cuda-python/tree/main/cuda_python/docs/source/release.
Currently this is low priority because there is no such thing like "libtile", only `tileiras` which is an executable. We prefer in-process compilation through compiler libraries over subprocess calls to...
Tracking the failure below. xref: https://github.com/NVIDIA/cuda-python/pull/1242#issuecomment-3545628920 All details are in the full logs: [qa_bindings_windows_2025-11-18+102913_build_log.txt](https://github.com/user-attachments/files/23611948/qa_bindings_windows_2025-11-18%2B102913_build_log.txt) [qa_bindings_windows_2025-11-18+102913_tests_log.txt](https://github.com/user-attachments/files/23611951/qa_bindings_windows_2025-11-18%2B102913_tests_log.txt) The only non-obvious detail: `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0` was installed from `cuda_13.0.1_windows.exe` **EDIT:** The...
Capturing feedbacks provided by @xiakun-lu offline. The NCCL team noticed that `uv sync` complains `nccl4py[cu12]` and `nccl4py[cu13]` are incompatible (`uv venv && uv pip install -e .` works out of...
Instead of relying on stream capturing, which is considered an implementation detail (that in the future we could allow users to opt in or out), our graph builder APIs were...