ParallelReductionsBenchmark icon indicating copy to clipboard operation
ParallelReductionsBenchmark copied to clipboard

Thrust, CUB, TBB, AVX2, CUDA, OpenCL, OpenMP, SyCL - all it takes to sum a lot of numbers fast!

Results 7 ParallelReductionsBenchmark issues
Sort by recently updated
recently updated
newest added

On some architectures, we won't have access to CUDA. On others, OpenCL might be an issue. Furthermore, the compiler flags for non-GCC & non-NVCC builds have to be properly configured....

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones. Let's say our hardware supports fast 16...

good first issue

It's hard to control the affinity of `std::thread` in conjunction with NUMA node-assignment, so we need to add a lower-level thread-pool implementation that directly uses the POSIX API.

enhancement
good first issue

The project currently provides multiple backends for x86: SSE, AVX, AVX-512; but none for Arm. NEON and SVE backends should be added.

good first issue

Introduces reduce_bench.py, a Python script to benchmark parallel reductions on NVIDIA GPUs using the cuda.cccl library, and updates the README with usage instructions and example output. This allows users to...

Now that [CCCL v3](https://github.com/NVIDIA/cccl/releases/tag/v3.0.0) can be used for [efficient parallel reductions in Python](https://developer.nvidia.com/blog/delivering-the-missing-building-blocks-for-nvidia-cuda-kernel-fusion-in-python) it would be great to create an additional benchmark file - `reduce_bench.py` with Python-ic JIT-ed kernels for...

help wanted
good first issue

Thanks for sharing your benchmarks and [your thoughts on your blog](https://ashvardanian.com/posts/beyond-openmp-in-cpp-rust/). Another framework you may want to consider in Rust is [Paralight](https://docs.rs/paralight), which offers the best of both worlds: a...