StaticArrays.jl Don't force loop unrolling in reductions

Instead of unrolling the whole loop in the reduction, we can just keep the loop and let LLVM do its job with loop unrolling (since the number of loop iterations is known).

Calling this inner function directly we can see that for larger matrices this can give good speedup:

f(s) =  StaticArrays._mapreduce(identity, +, Colon(), NamedTuple(), StaticArrays.same_size(s), s)
s = rand(SMatrix{8,8})

After PR:

julia> @btime f($s)
  10.179 ns (0 allocations: 0 bytes)

Before PR

julia> @btime f($s)
  30.229 ns (0 allocations: 0 bytes)

The reason for this is that the loop vectorizer has a much easier time with the loop than the SLP with the unrolled code.

Now, why am I calling the inner function directly? Well, for some reason, Julia refuses to inline this new function. Instead, we get:

julia> @code_typed sum(s)
CodeInfo(
181 1 ─ %1 = StaticArrays.:+::typeof(+)                                               │╻  #sum#255
    │   %2 = invoke #kw##mapreduce()($(QuoteNode((dims = Colon(),)))::NamedTuple{(:dims,),Tuple{Colon}}, StaticArrays.mapreduce::typeof(mapreduce), StaticArrays.identity::Function, %1::Function, _2::SArray{Tuple{8,8},Float64,2,64})::Float64
    └──      return %2                                                                │
) => Float64

and the result is not pretty:

julia> @btime sum($s)
  121.209 ns (3 allocations: 1.08 KiB)

But if we can fix that inlining problem, this is probably worth doing.

Sep 13 '18 17:09 KristofferC

Tests fail because reductions now use SIMD and the tests check ==.

Sep 13 '18 17:09 KristofferC

Tests fail because reductions now use SIMD and the tests check ==.

Happy to switch to isapprox with appropriate tolerence.

Sep 17 '18 23:09 andyferris

Inlining is fixed by https://github.com/JuliaLang/julia/pull/29258 so now there is a reason to finish this up :)

Sep 18 '18 21:09 KristofferC

Updated.

Some benchmarks made on https://github.com/JuliaLang/julia/pull/29258

using BenchmarkTools
using StaticArrays
x = rand(MMatrix{8,8}); s = rand(SMatrix{8,8});

Before

julia> @btime map!(x -> x*2, $x, $s);
  11.056 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  29.363 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  124.234 ns (3 allocations: 1.08 KiB)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  134.850 ns (4 allocations: 1.09 KiB)

After

julia> @btime map!(x -> x*2, $x, $s);
  12.783 ns (0 allocations: 0 bytes)

julia> @btime sum($s);
  12.906 ns (0 allocations: 0 bytes)

julia> @btime sum(abs2, $s);
  12.283 ns (0 allocations: 0 bytes)

julia> @btime mapreduce(x->x^2, +, $s; init=0.5);
  12.399 ns (0 allocations: 0 bytes)

Sep 18 '18 21:09 KristofferC

Arguably the old methods should be left and we should @static if based on the VERSION... Thoughts?

Sep 18 '18 21:09 KristofferC

Coverage decreased (-0.07%) to 94.923% when pulling 5eb628a59f13ae11e63a85327ba11c627ca066c6 on KristofferC:kc/loop into 81dd6cab51aff389b7a4f2dd0208a8c47f12b352 on JuliaArrays:master.

Sep 18 '18 22:09 coveralls

One interesting thing about this - this optimization goes with the assumption that StaticArrays have dense (or at least SIMD-fetchable) layout. I wonder how this works on things like the computed Rotations matrices? (Not that you'd ever want to do any of these operations on those!) I really wish we had traits about layout...

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

Sep 20 '18 03:09 andyferris

Do you have benchmarks for small vectors and matrices? It's important (to me at least) that we don't slow down any sums or norms or dot products of 2-vectors or 3-vectors, or operations involving 2x2 or 3x3 matrices.

I was also wondering about this.

Sep 20 '18 21:09 tkoolen

using BenchmarkTools
using StaticArrays
using LinearAlgebra
for siz in (1,2,3,4,8)
    println("size = $siz x $siz")
    # Refs to avoid inlining into benchmark loop
    s = Ref(rand(SMatrix{siz, siz}))
    @btime sum(abs2, $s[]);
    @btime sum($s[]);
    @btime mapreduce(x->x^2, +, $s[]; init=0.5);
    @btime dot($s[], $s[])
    @btime norm($s[], 1)
    @btime norm($s[], 2)
    @btime norm($s[], 5)
    @btime det($s[])
end

PR          Master
size = 1 x 1 | size = 1 x 1
1.596 ns | 1.503 ns
1.503 ns | 1.503 ns
1.792 ns | 1.909 ns
1.596 ns | 1.504 ns
1.503 ns | 1.503 ns
4.155 ns | 4.155 ns
59.980 ns | 60.205 ns
1.503 ns | 1.596 ns
size = 2 x 2 | size = 2 x 2
1.707 ns | 1.795 ns
1.712 ns | 1.504 ns
2.037 ns | 1.803 ns
1.803 ns | 2.089 ns
2.047 ns | 2.087 ns
4.715 ns | 4.157 ns
125.824 ns | 124.740 ns
1.503 ns | 1.597 ns
size = 3 x 3 | size = 3 x 3
3.709 ns | 2.373 ns
2.375 ns | 2.693 ns
2.034 ns | 2.539 ns
2.710 ns | 2.979 ns
2.980 ns | 3.073 ns
3.055 ns | 2.988 ns
4.716 ns | 4.163 ns
233.326 ns | 231.020 ns
2.391 ns | 2.393 ns
size = 4 x 4 | size = 4 x 4
2.712 ns | 5.417 ns
2.044 ns | 4.458 ns
3.038 ns | 6.229 ns
6.893 ns | 6.899 ns
2.709 ns | 5.494 ns
4.718 ns | 6.965 ns
385.665 ns | 382.798 ns
14.572 ns | 14.579 ns
size = 8 x 8 | size = 8 x 8
41.797 ns | 132.093 ns (3 allocations: 1.08 KiB)
29.766 ns | 29.393 ns
14.865 ns | 144.587 ns (4 allocations: 1.09 KiB)
47.807 ns | 73.454 ns
30.111 ns | 33.949 ns
48.478 ns | 37.048 ns
1.651 μs | 1.421 μs
351.538 ns | 355.686 ns

Doesnt seem to be an improvement in all cases, will look at it closer.

Sep 20 '18 21:09 KristofferC

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

Sep 21 '18 18:09 KristofferC

Removed the map! loop since it seems LLVM decided not to unroll the loop even though it perhaps was advantageous to do so.

Could you try un-reverting that change and see if it is better now? Perhaps use @simd ivdep to indicate non-aliasing (via the ivdep)?

I would like to see this get merged.

Mar 25 '23 21:03 chriselrod