ParallelReductionsBenchmark
ParallelReductionsBenchmark copied to clipboard
Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!
On some architectures, we won't have access to CUDA. On others, OpenCL might be an issue. Furthermore, the compiler flags for non-GCC & non-NVCC builds have to be properly configured....
Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones. Let's say our hardware supports fast 16...
It's hard to control the affinity of `std::thread` in conjunction with NUMA node-assignment, so we need to add a lower-level thread-pool implementation that directly uses the POSIX API.
The project currently provides multiple backends for x86: SSE, AVX, AVX-512; but none for Arm. NEON and SVE backends should be added.
Introduces reduce_bench.py, a Python script to benchmark parallel reductions on NVIDIA GPUs using the cuda.cccl library, and updates the README with usage instructions and example output. This allows users to...
Now that [CCCL v3](https://github.com/NVIDIA/cccl/releases/tag/v3.0.0) can be used for [efficient parallel reductions in Python](https://developer.nvidia.com/blog/delivering-the-missing-building-blocks-for-nvidia-cuda-kernel-fusion-in-python) it would be great to create an additional benchmark file - `reduce_bench.py` with Python-ic JIT-ed kernels for...
Thanks for sharing your benchmarks and [your thoughts on your blog](https://ashvardanian.com/posts/beyond-openmp-in-cpp-rust/). Another framework you may want to consider in Rust is [Paralight](https://docs.rs/paralight), which offers the best of both worlds: a...