tjoli
tjoli
In fact it happens during treatment of a single matrix. I noticed the same behavior with 0.2.20 hence I upgraded. The inference of single-thread behavior comes from looking at output...
I checked matrices of sizes 1800 and 3800. The problem appears with lapack routine ZHEEV: if I want all eigenvectors (so ZHEEV('V'....)) then it slows down to one thread while...
my "standard" code uses no LAPACK stuff at all. Just filling matrices, a dumb case as a test. it is not a test of any LAPACK/BLAS stuff.
Calling LAPACK DSTEQR leads to the same problem when I want all eigenvectors of a real matrix. It turns single-threaded after a while. Tested on a order=5600 matrix
so perf record without any settings `NUM_THREADS` set 30.43% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zlasr_ 19.90% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zhemv_U 17.16% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_kernel_l 16.65% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_incopy 8.06% lapack.exe...
now with `OPENBLAS_NUM_THREADS=1` and `OMP_NUM_THREADS=1` 81.34% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zlasr_ 7.77% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_kernel_r 5.46% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zhemv_U 3.19% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_kernel_l 0.38% lapack.exe libopenblas_nehalemp-r0.3.3.so [.] zgemm_incopy...
similar behavior for DSTEQR first without any settings: 28.86% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlasr_ 22.70% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_kernel 18.92% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_incopy 17.56% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dsymv_kernel_4x4 7.62% su2lapack.exe...
now DSTEQR with `OMP_NUM_THREADS=1` and `OPENBLAS_NUM_THREADS=1` 80.43% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlasr_ 8.16% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dgemm_kernel 5.84% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dsymv_kernel_4x4 0.77% su2lapack.exe libopenblas_nehalemp-r0.3.3.so [.] dlartg_ 0.66% su2lapack.exe su2lapack.exe [.]...
Timings DSTEQR: omp and openblas threads=1 : 35s -> DLASR 80% omp and openblas threads=6 : 56s -> DLASR 57% omp and openblas threads=12 : 114s -> DLASR 30%
with `OMP_NUM_THREADS=12` 48.95% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zlasr_ 23.26% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zhemv_U 10.75% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemm_kernel_r 6.75% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so [.] zgemm_kernel_l 5.52% lapack.exe [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath 1.74% lapack.exe libopenblas_nehalemp-r0.3.4.dev.so...