xsimd Feature request: within vector reductions

I could not find any mention of doing within vector reductions, or equally transposing blocks of vectors to make such reductions vectorizable.

If I am summing a long list of values (or doing a large dot product, or any other kind of reduction), I can efficiently reduce a block at a time into a SIMD vector. Then at the end I'll need to run the reduction (+, *, min, max, whatever) within the vector. Are there functions for this?

Alternatively, if I am doing N identical reductions at a time, then I can transpose the blocks, then do the N reductions in parallel. The problem is that different types have different vector widths, so N might not be the same as the vector width. Ideally there would be a templated method which can take an array of N vectors (perhaps limited to powers of 2, >= 2), and do the reduction in parallel on them. Does this exist, or can you see value in implementing this?

Jun 28 '19 17:06 ojwoodford

So far we only have an implementation of the std::reduce interface (https://github.com/QuantStack/xsimd/blob/master/include/xsimd/stl/algorithms.hpp#L144)

Furthermore, we implemented some of the ideas you mention in xtensor (https://github.com/QuantStack/xtensor) where we have a general framework for reducers. However, these reducers are not yet using explicit SIMD instructions (rather trusting the autovectorization). So this is definitely stuff that we're interested in.

If you have concrete implementaiton ideas, we would be more than happy to review a PR in xsimd (or maybe even xtensor?).

Jun 29 '19 06:06 wolfv

Thanks. The reduce algorithm implementation in xsimd is a useful baseline implementation.

A useful function would be:

template <typename Scalar>
std::array<Scalar, 4> batch_sum(batch_type a, batch_type b, batch_type c, batch_type d);

such that the first element of the output is the sum of elements in a, etc. You could have functions for 2, 4 and 8 inputs, and also equivalent batch_prod functions.

You could then have reducers which process 2, 4 or 8 rows at a time. In my case I'd use it to do a matrix multiply on 2x2 sub-blocks, efficiently summing up the dot products of the 4 row/column combinations in parallel at the end of each block computation.

I wouldn't know how to implement those methods in the various SIMD instruction sets, though.

Jul 03 '19 00:07 ojwoodford

Actually, you already have the in-vector reductions I was asking about, at least for summing, hadd and haddp: https://github.com/QuantStack/xsimd/blob/dde445fe36a2bc218e170408a4c48dab40833e0f/include/xsimd/types/xsimd_sse_float.hpp#L637 However, I find that there is a significant time penalty associated with having an array of blocks (required so as to pass haddp a pointer to the array) versus having individually declared batches. This is compiling with clang on Mac.

Jul 03 '19 06:07 ojwoodford

The thing is the number of accepted batches depends on your instruction set. If you want to write client code independent from that, you have to push them in a container, otherwise you need to rely on other technics to avoid compilation errors:

template <size_t N>
batch<double, N> sum_buffer(double* buf, size_t index);

template <>
batch<double, 2> sum_buffer(double* buf)
{
    return haddp(load_aligned(buf),
                          load_aligned(buf + 2)); 
}

template <>
batch<double, 4> sum_buffer(double* buf)
{
    return haddp(load_aligned(buf),
                          load_aligned(buf + 4),
                          load_aligned(buf + 8),
                          load_aligned(buf + 12)); 
}

// As many specializations  as supported batch sizes
// buf is an aligned buffer of N double, we assume that
// N is a multiple of batch_size * batch_size
batch<double, batch_size> res(0.);
for(size_t i = 0; i < N; i += batch_size * batch_size)
{
    res += sum_buffer(buffer + i);
}

It's even worst if you want to write it for many value types: since partial specialization of functions is illegal, you have to wrap the sum_buffer functions in struct that you partially specialize.

Jul 03 '19 08:07 JohanMabille

Yes, I see that. But for the algorithms I was envisaging, you'd know the number of batches, if not the batch size.

For now I'm just fixing on batch<ScalarType, 4>, and it works as desired.

Jul 03 '19 17:07 ojwoodford

I still think there is a need for cross batch additions for a fixed number of batches, regardless of the (instruction set specific) batch length. An example algorithm which requires this: Say you are writing a matrix multiply function, and you want to compute each 2x2 subblock of the output matrix at a time. You can do this by accumulating the sums of row, column dot products into 4 batches, then reducing the batches at the end. It would be nice to have an efficient way to do this reduction across 4 batches. Note the number of batches here does not depend on the architecture of the CPU, but on the design of the algorithm. So it could be useful to have functions hadd2 and hadd4 (even hadd8), which reduce the respective number of batches.

Nov 22 '19 17:11 ojwoodford

We now have an `xsimd::reduce`` functions that takes a functor and a batch and performs the acrtual reduction: https://xsimd.readthedocs.io/en/latest/api/reducer_index.html#_CPPv4I000E6reduce1TRR1FRK5batchI1T1AE.

Specific reducers are available for add`, maxandmin``

Mar 11 '23 20:03 serge-sans-paille