sterrettm2
sterrettm2
This patch adds support for descending kv-sort and ascending/descending kv-select and kv-partial_sort For reference, some benchmarks comparing to Pytorch's scalar implementation are provided: With normally distributed float32: ``` Partial Sort...
This patch rewrites all of the single vector sorting and bitonic merging to use swizzle ops and generic masks to reduce code duplication. It also centralizes all of this logic...
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available. For contiguous data, this can be over a 10x speedup for...
Fixes the bug with nested OpenMP by adding #pragma omp taskwait Changes the task_threshold when OpenMP is enabled but parallelization isn't chosen from 0 to the max value for arrsize_t;...
This is another patch demonstrating how the current NumPy SIMD code could be converted to Highway, similar to #25781. All tests pass on my local AVX512 and AVX2 machine. On...
**Summary** I have been wanting to provide a contribution to oneDAL, but I have not been able to figure out how to get incremental builds working under either bazel/make. The...
This patch enabled the non-avx512fp16 _Float16 sorting to be used by the dynamic dispatch logic, as well as integrating it better into the static dispatch logic. It is vastly faster...
This patch tries to make type errors better when building with the static functions. It should hopefully make it more clear when the typing is an issue. Here is a...