Performance of mixed-type (matrix-) multiplication
Hi,
I noticed that mixed-type multiplication (e.g. *(::SMatrix, ::Matrix)) is slow. I searched the issues but didn't find one that mentions this.
For example:
julia> SA = @SMatrix randn(3,3);
julia> A = Matrix(SA);
julia> B = randn(3, 1_000);
julia> @btime $SA * $B;
14.455 μs (5 allocations: 54.23 KiB)
julia> @btime $A * $B;
2.121 μs (2 allocations: 23.48 KiB)
This is, because *(::SMatrix, ::Matrix) doesn't hit BLAS and goes via LinearAlgebra.generic_matmul!().
This case doesn't seem to be covered by the benchmarks (unless I missed it), so I'm wondering whether it would it be in scope (of this package) to address this.
If not, I think it should be mentioned somewhere though.
Note that converting to standard Array followed by Matrix-Matrix multiplication is much faster than the SMatrix-Matrix multiplication (because it hits BLAS), and allocates less:
julia> @btime Matrix($SA) * $B;
2.062 μs (3 allocations: 23.61 KiB)
Yes, I think dispatching mixed-type multiplication to BLAS would be a good idea and fits the scope of StaticArrays.jl. I can review a PR that adds that. Mixed calls could even return HybridArray (from HybridArrays.jl) but that would be out of scope for StaticArrays.jl I think.