ArrayLayouts.jl icon indicating copy to clipboard operation
ArrayLayouts.jl copied to clipboard

Extremely slow BLAS muladd when α,β are Int

Open sztal opened this issue 1 year ago • 5 comments

Hi, perhaps I do not understand something about how ArrayLayouts are supposed to work and be used, but it seems that BLAS matrix multiplication (both default_blasmul! and tiled_blasmul!) is very slow in comparison to 5-argument LinearAlgebra.mul!.

Below is a reproducible example.

using LinearAlgebra
import ArrayLayouts: default_blasmul!, tiled_blasmul!, tile_size

A, B, C = (rand(1000, 1000) for _ in 1:3)

# mul! is fast
mul!(copy(C), A, B, 1, 1) ≈ A*B + C  # true
@btime mul!(copy(C), A, B, 1, 1);
# 6.448 ms (4 allocations: 7.63 MiB)

# default_blasmul! is super slow!!
default_blasmul!(1, A, B, 1, copy(C)) ≈ A*B + C  # true
@btime default_blasmul!(1, A, B, 1, copy(C));
# 852.139 ms (3 allocations: 7.63 MiB)

# tiled_blasmul! is also very slow!
ts = tile_size(eltype(A), eltype(B), eltype(C))
tiled_blasmul!(ts, 1, A, B, 1, copy(C)) ≈ A*B + C  # true
@btime tiled_blasmul!(ts, 1, A, B, 1, copy(C));
# 402.555 ms (12 allocations: 7.66 MiB)

And this of course affects operations using MulAdd wrapper. What am I doing wrong? Or is it really a bug?

Version info

Julia Version 1.11.3
Commit d63adeda50d (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 9 5900HX with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)
Environment:
  JULIA_PROJECT = @.
  JULIA_PKG_PRESERVE_TIERED_INSTALLED = true
  JULIA_REVISE = manual

PKG status

  [dce04be8] ArgCheck v2.4.0
  [7d9fca2a] Arpack v0.5.4
  [4c555306] ArrayLayouts v1.11.0
  [86223c79] Graphs v1.12.0
⌃ [0b1a1467] KrylovKit v0.9.3
⌃ [5078a376] LazyArrays v2.4.0
  [2ab3a3ac] LogExpFunctions v0.3.29
  [5a0628fe] Supposition v0.3.5
  [37e2e46d] LinearAlgebra v1.11.0
  [2f01184e] SparseArrays v1.11.0
  [f489334b] StyledStrings v1.11.0
  [8dfed614] Test v1.11.0

sztal avatar Feb 07 '25 17:02 sztal

Note that mul! uses native BLAS, where default_blasmul! and tiled_blasmul! are Julia code, that were copied from LinearAlgebra.

Here gives an indication of what's going on:

julia> @time mul!(C, A, B, 1, 1); # calls native BLAS
  0.021381 seconds (1 allocation: 32 bytes)

julia> @time muladd!(1.0, A, B, 1.0, C); # also calls native BLAS
  0.021018 seconds

julia> @time ArrayLayouts.tiled_blasmul!(ts, 1, A, B, 1, C); # slower Julia code

  0.307682 seconds (9 allocations: 30.609 KiB)

julia> @time muladd!(1, A, B, 1, C); # should call native BLAS but calls tiled_blasmul! cause of types 🤦‍♂️
  0.315244 seconds (9 allocations: 30.609 KiB)

dlfivefifty avatar Feb 08 '25 13:02 dlfivefifty

I see, thanks for the explanation! So - let me check whether I understand - this happens because I used α and β other than the eltype of the matrices or other than Float32 or Float64 which are what BLAS supports?

Could you specify what are the conditions for muladd!, Mul and MulAdd to dispatch correctly to BLAS?

sztal avatar Feb 08 '25 16:02 sztal

I think we should promote the scalars to float in such cases. This is what LinearAlgebra does, and this will ensure that we hit the BLAS paths more frequently.

jishnub avatar Feb 08 '25 17:02 jishnub

I think this is a good idea - the current behavior is very unexpected and confusing, especially as it is not consistent with what LinearAlgebra does.

sztal avatar Feb 08 '25 17:02 sztal

It should be an easy fix

dlfivefifty avatar Feb 10 '25 10:02 dlfivefifty