Thread overhead involving FFT/BLAS on dual sockets

Open shipengcheng1230 opened this issue 4 years ago • 0 comments

Hi, sorry if the title is not crystal-clear. I found a problem when doing computation on a cluster:

using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random

println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")

function ode_1(du, u, p, t)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
end

function ode_2(du, u, p, t)
    v1, v2, plan, _, _ = p
    mul!(v1, plan, u)
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    ldiv!(v2, plan, v1)
end

function ode_3(du, u, p, t)
    _, _, _, K, w = p
    @batch for i ∈ eachindex(u)
        du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
    end
    mul!(w, K, vec(u))
end

begin
    N = 64
    n = (2N-1) * N
    Random.seed!(42)
    u = rand(2N-1, N)
    du = similar(u)
    v₁ = rand(ComplexF64, N, N)
    v₂ = rand(2N-1, N)
    K = rand(n, n)
    w = zeros(n)
    FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
    plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
    p = (v₁, v₂, plan, K, w)
    BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end

println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)

println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)

println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)

Running on my local computer (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), performance scales well even with oversubscribing (both Polyester and MKL handle that, I do observe CPU usage is always large than specified during ODE-2)

Good Results

~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl                                   pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl                                  pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl                                  pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl                                  pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  77.470 μs (2 allocations: 160 bytes)

However, running on a remote computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled), performance deteriorates when starting Julia with more than one socket threads:

Bad Results

18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  96.324 μs (2 allocations: 160 bytes)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  92.357 μs (2 allocations: 160 bytes)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
  148.203 μs (2 allocations: 160 bytes)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  32.706 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  10.987 ms (2 allocations: 160 bytes) # oops!

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  25.268 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
  12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
  13.059 ms (2 allocations: 160 bytes) # oops!

Replace all @batch with @threads, which shows no overhead but not as fast (that is why to switch to this package):

Results with `@threads`

18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  97.260 μs (91 allocations: 8.09 KiB)
ODE-2: FFT + EW + FFT
  182.220 μs (93 allocations: 8.25 KiB)
ODE-3: EW + BLAS
  181.990 μs (93 allocations: 8.25 KiB)

19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  88.000 μs (96 allocations: 8.55 KiB)
ODE-2: FFT + EW + FFT
  185.106 μs (98 allocations: 8.70 KiB)
ODE-3: EW + BLAS
  190.014 μs (99 allocations: 8.73 KiB)

20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  94.474 μs (101 allocations: 8.98 KiB)
ODE-2: FFT + EW + FFT
  184.741 μs (103 allocations: 9.14 KiB)
ODE-3: EW + BLAS
  190.534 μs (104 allocations: 9.17 KiB)

27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  107.443 μs (136 allocations: 12.11 KiB)
ODE-2: FFT + EW + FFT
  173.397 μs (138 allocations: 12.27 KiB)
ODE-3: EW + BLAS
  169.724 μs (138 allocations: 12.27 KiB)

36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
  124.303 μs (181 allocations: 16.11 KiB)
ODE-2: FFT + EW + FFT
  203.427 μs (183 allocations: 16.27 KiB)
ODE-3: EW + BLAS
  209.808 μs (183 allocations: 16.27 KiB)

When I benchmark FFT or BLAS or those dummy element-wise operations separately on the remote cluster, there is no problem at all. But when things get mixed up, it doesn't work as expected. When I use 18 CPUs, the CPU usage is actually close to 3500%. Using more than 18 CPUs, the number is capped at 3600% but those overheads show up as above. Could it be related to the hyperthreading or something else (like MKL)? Thank you!

May 28 '21 19:05 shipengcheng1230