Thread overhead involving FFT/BLAS on dual sockets
Hi, sorry if the title is not crystal-clear. I found a problem when doing computation on a cluster:
using FFTW, BenchmarkTools, LinearAlgebra, Printf, Polyester, Random
println("Julia num threads: $(Threads.nthreads()), Total Sys CPUs: $(Sys.CPU_THREADS)")
println("FFT provider: $(FFTW.get_provider()), BLAS: $(BLAS.vendor())")
function ode_1(du, u, p, t)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
end
function ode_2(du, u, p, t)
v1, v2, plan, _, _ = p
mul!(v1, plan, u)
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
ldiv!(v2, plan, v1)
end
function ode_3(du, u, p, t)
_, _, _, K, w = p
@batch for i ∈ eachindex(u)
du[i] = sin(cos(tan(exp(log(u[i] + 1)))))
end
mul!(w, K, vec(u))
end
begin
N = 64
n = (2N-1) * N
Random.seed!(42)
u = rand(2N-1, N)
du = similar(u)
v₁ = rand(ComplexF64, N, N)
v₂ = rand(2N-1, N)
K = rand(n, n)
w = zeros(n)
FFTW.set_num_threads(Threads.nthreads()) # use all CPUs
plan = plan_rfft(du, 1; flags=FFTW.PATIENT)
p = (v₁, v₂, plan, K, w)
BLAS.set_num_threads(Threads.nthreads()) # use all CPUs
end
println("ODE-1: only element-wise (EW) ops")
@btime ode_1($du, $u, $p, 1.0)
println("ODE-2: FFT + EW + FFT")
@btime ode_2($du, $u, $p, 1.0)
println("ODE-3: EW + BLAS")
@btime ode_2($du, $u, $p, 1.0)
Running on my local computer (2 * Intel(R) Xeon(R) Gold 6136, 2 * 12 CPUs, hyperthreading enabled), performance scales well even with oversubscribing (both Polyester and MKL handle that, I do observe CPU usage is always large than specified during ODE-2)
Good Results
~/codes » julia16 --check-bounds=no -O3 -t 6 ex2.jl pshi@discover
Julia num threads: 6, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
114.905 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
172.101 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
173.103 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 12 ex2.jl pshi@discover
Julia num threads: 12, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
56.885 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
106.648 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
106.777 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 24 ex2.jl pshi@discover
Julia num threads: 24, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
29.294 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
77.235 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.275 μs (2 allocations: 160 bytes)
----------------------------------------------------------------------------------------------------
~/codes » julia16 --check-bounds=no -O3 -t 48 ex2.jl pshi@discover
Julia num threads: 48, Total Sys CPUs: 48
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
28.303 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
76.601 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
77.470 μs (2 allocations: 160 bytes)
However, running on a remote computer cluster (2 * Intel(R) Xeon(R) Gold 5220, 2 * 18 CPUs, hyperthreading disabled), performance deteriorates when starting Julia with more than one socket threads:
Bad Results
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
42.415 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
96.472 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
96.324 μs (2 allocations: 160 bytes)
19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
40.662 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
92.047 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
92.357 μs (2 allocations: 160 bytes)
20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
39.156 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
143.665 μs (2 allocations: 160 bytes)
ODE-3: EW + BLAS
148.203 μs (2 allocations: 160 bytes)
27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
32.706 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
10.992 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
10.987 ms (2 allocations: 160 bytes) # oops!
36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
25.268 μs (0 allocations: 0 bytes)
ODE-2: FFT + EW + FFT
12.047 ms (2 allocations: 160 bytes) # oops!
ODE-3: EW + BLAS
13.059 ms (2 allocations: 160 bytes) # oops!
Replace all @batch with @threads, which shows no overhead but not as fast (that is why to switch to this package):
Results with `@threads`
18 CPU
Julia num threads: 18, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
97.260 μs (91 allocations: 8.09 KiB)
ODE-2: FFT + EW + FFT
182.220 μs (93 allocations: 8.25 KiB)
ODE-3: EW + BLAS
181.990 μs (93 allocations: 8.25 KiB)
19 CPU
Julia num threads: 19, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
88.000 μs (96 allocations: 8.55 KiB)
ODE-2: FFT + EW + FFT
185.106 μs (98 allocations: 8.70 KiB)
ODE-3: EW + BLAS
190.014 μs (99 allocations: 8.73 KiB)
20 CPU
Julia num threads: 20, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
94.474 μs (101 allocations: 8.98 KiB)
ODE-2: FFT + EW + FFT
184.741 μs (103 allocations: 9.14 KiB)
ODE-3: EW + BLAS
190.534 μs (104 allocations: 9.17 KiB)
27 CPU
Julia num threads: 27, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
107.443 μs (136 allocations: 12.11 KiB)
ODE-2: FFT + EW + FFT
173.397 μs (138 allocations: 12.27 KiB)
ODE-3: EW + BLAS
169.724 μs (138 allocations: 12.27 KiB)
36 CPU
Julia num threads: 36, Total Sys CPUs: 36
FFT provider: mkl, BLAS: openblas64
ODE-1: only element-wise (EW) ops
124.303 μs (181 allocations: 16.11 KiB)
ODE-2: FFT + EW + FFT
203.427 μs (183 allocations: 16.27 KiB)
ODE-3: EW + BLAS
209.808 μs (183 allocations: 16.27 KiB)
When I benchmark FFT or BLAS or those dummy element-wise operations separately on the remote cluster, there is no problem at all. But when things get mixed up, it doesn't work as expected. When I use 18 CPUs, the CPU usage is actually close to 3500%. Using more than 18 CPUs, the number is capped at 3600% but those overheads show up as above. Could it be related to the hyperthreading or something else (like MKL)? Thank you!