ParallelReductionsBenchmark issues

Add support for partial builds

1

On some architectures, we won't have access to CUDA. On others, OpenCL might be an issue. Furthermore, the compiler flags for non-GCC & non-NVCC builds have to be properly configured....

ashvardanian

Using Mat-Mul Instructions, like Arm SME and Intel AMX

5

Array reductions can be represented as a two-stage pipeline built on top of matrix-vector multiplications, where the vector is made of all ones. Let's say our hardware supports fast 16...

ashvardanian

good first issue

POSIX thread-pool for NUMA

It's hard to control the affinity of `std::thread` in conjunction with NUMA node-assignment, so we need to add a lower-level thread-pool implementation that directly uses the POSIX API.

ashvardanian

enhancement

good first issue

Arm NEON and SVE Reductions

The project currently provides multiple backends for x86: SSE, AVX, AVX-512; but none for Arm. NEON and SVE backends should be added.

ashvardanian

good first issue

Add Python CUDA reduction benchmark with cuda.cccl

2

Introduces reduce_bench.py, a Python script to benchmark parallel reductions on NVIDIA GPUs using the cuda.cccl library, and updates the README with usage instructions and example output. This allows users to...

AnshSinghSonkhia

Add Python benchmarks for the new CUDA DSL/JIT

2

Now that [CCCL v3](https://github.com/NVIDIA/cccl/releases/tag/v3.0.0) can be used for [efficient parallel reductions in Python](https://developer.nvidia.com/blog/delivering-the-missing-building-blocks-for-nvidia-cuda-kernel-fusion-in-python) it would be great to create an additional benchmark file - `reduce_bench.py` with Python-ic JIT-ed kernels for...

ashvardanian

help wanted

good first issue

Adding Paralight to the Rust benchmarks?

Thanks for sharing your benchmarks and [your thoughts on your blog](https://ashvardanian.com/posts/beyond-openmp-in-cpp-rust/). Another framework you may want to consider in Rust is [Paralight](https://docs.rs/paralight), which offers the best of both worlds: a...

gendx

ParallelReductionsBenchmark
ParallelReductionsBenchmark copied to clipboard

Metadata

Add support for partial builds

Using Mat-Mul Instructions, like Arm SME and Intel AMX

POSIX thread-pool for NUMA

Arm NEON and SVE Reductions

Add Python CUDA reduction benchmark with cuda.cccl

Add Python benchmarks for the new CUDA DSL/JIT

Adding Paralight to the Rust benchmarks?

← Metadata

Owner

Metadata

ParallelReductionsBenchmark ParallelReductionsBenchmark copied to clipboard

Metadata

Add support for partial builds

Using Mat-Mul Instructions, like Arm SME and Intel AMX

POSIX thread-pool for NUMA

Arm NEON and SVE Reductions

Add Python CUDA reduction benchmark with cuda.cccl

Add Python benchmarks for the new CUDA DSL/JIT

Adding Paralight to the Rust benchmarks?

← Metadata

Owner

Metadata

ParallelReductionsBenchmark
ParallelReductionsBenchmark copied to clipboard