ParallelReductionsBenchmark
ParallelReductionsBenchmark copied to clipboard
Add Python CUDA reduction benchmark with cuda.cccl
Introduces reduce_bench.py, a Python script to benchmark parallel reductions on NVIDIA GPUs using the cuda.cccl library, and updates the README with usage instructions and example output. This allows users to compare naive CuPy reductions with optimized CUDA JIT reductions from Python.
This solves #9