Extremely slow BLAS muladd when α,β are Int
Hi, perhaps I do not understand something about how ArrayLayouts are supposed to work and be used, but it seems that BLAS matrix multiplication (both default_blasmul! and tiled_blasmul!) is very slow in comparison to 5-argument LinearAlgebra.mul!.
Below is a reproducible example.
using LinearAlgebra
import ArrayLayouts: default_blasmul!, tiled_blasmul!, tile_size
A, B, C = (rand(1000, 1000) for _ in 1:3)
# mul! is fast
mul!(copy(C), A, B, 1, 1) ≈ A*B + C # true
@btime mul!(copy(C), A, B, 1, 1);
# 6.448 ms (4 allocations: 7.63 MiB)
# default_blasmul! is super slow!!
default_blasmul!(1, A, B, 1, copy(C)) ≈ A*B + C # true
@btime default_blasmul!(1, A, B, 1, copy(C));
# 852.139 ms (3 allocations: 7.63 MiB)
# tiled_blasmul! is also very slow!
ts = tile_size(eltype(A), eltype(B), eltype(C))
tiled_blasmul!(ts, 1, A, B, 1, copy(C)) ≈ A*B + C # true
@btime tiled_blasmul!(ts, 1, A, B, 1, copy(C));
# 402.555 ms (12 allocations: 7.66 MiB)
And this of course affects operations using MulAdd wrapper. What am I doing wrong? Or is it really a bug?
Version info
Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 16 × AMD Ryzen 9 5900HX with Radeon Graphics
WORD_SIZE: 64
LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
JULIA_PROJECT = @.
JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
JULIA_REVISE = manual
PKG status
[dce04be8] ArgCheck v2.4.0
[7d9fca2a] Arpack v0.5.4
[4c555306] ArrayLayouts v1.11.0
[86223c79] Graphs v1.12.0
⌃ [0b1a1467] KrylovKit v0.9.3
⌃ [5078a376] LazyArrays v2.4.0
[2ab3a3ac] LogExpFunctions v0.3.29
[5a0628fe] Supposition v0.3.5
[37e2e46d] LinearAlgebra v1.11.0
[2f01184e] SparseArrays v1.11.0
[f489334b] StyledStrings v1.11.0
[8dfed614] Test v1.11.0
Note that mul! uses native BLAS, where default_blasmul! and tiled_blasmul! are Julia code, that were copied from LinearAlgebra.
Here gives an indication of what's going on:
julia> @time mul!(C, A, B, 1, 1); # calls native BLAS
0.021381 seconds (1 allocation: 32 bytes)
julia> @time muladd!(1.0, A, B, 1.0, C); # also calls native BLAS
0.021018 seconds
julia> @time ArrayLayouts.tiled_blasmul!(ts, 1, A, B, 1, C); # slower Julia code
0.307682 seconds (9 allocations: 30.609 KiB)
julia> @time muladd!(1, A, B, 1, C); # should call native BLAS but calls tiled_blasmul! cause of types 🤦♂️
0.315244 seconds (9 allocations: 30.609 KiB)
I see, thanks for the explanation! So - let me check whether I understand - this happens because I used α and β other than the eltype of the matrices or other than Float32 or Float64 which are what BLAS supports?
Could you specify what are the conditions for muladd!, Mul and MulAdd to dispatch correctly to BLAS?
I think we should promote the scalars to float in such cases. This is what LinearAlgebra does, and this will ensure that we hit the BLAS paths more frequently.
I think this is a good idea - the current behavior is very unexpected and confusing, especially as it is not consistent with what LinearAlgebra does.
It should be an easy fix