OpenBLAS LU and eigen routines slower than MKL

Hi, I'm running some performance comparisons between OpenBLAS and MKL for LU and eigen routines. I see that OpenBLAS tests with, for example, dgetrf and dsyevd, are about 3 times slower than MKL. These are multi-threaded tests ran on a Skylake machine.

I wonder if you have any benchmark results vs. MKL and if so how do they look like?

Thanks.

Aug 26 '20 22:08 youyu3

What skylake? What size of inputs? Which OpenBLAS version? Any virtualisation? Is DGEMM slower too? (i.e if latest improves/fixes any)

Aug 27 '20 06:08 brada4

Skylake (Haswell refresh without AVX512) or SkylakeX (with AVX512) ? DGEMM performance for the latter should be about on par with MKL if you use a very recent 0.3.x release or the current develop branch. At small problem sizes MKL will probably be faster as OpenBLAS only has a single threshold for switching from using a single thread to using all available cores, while MKL seems to increase the thread count more gradually to match the workload. (Also MKL may be using a more efficient LAPACK than the "netlib" reference implementation)

Aug 27 '20 07:08 martin-frbg

Thanks. It is SkylakeX with AVX-512. Tried input sizes from m=n=5000 to m=n=15000. OpenBLAS 0.3.7 - this should have latest improvements? Can create a plot and profile DGEMM.

Aug 27 '20 15:08 youyu3

0.3.7 (from a year ago) had all parts of the initial AVX512 DGEMM implementation disabled as it turned out to be incorrect. AVX512 DGEMM reappeared in 0.3.8 and was further improved in 0.3.10, so ideally you should be trying that (or git develop)

Aug 27 '20 16:08 martin-frbg

Good to know, thanks! I'll try 0.3.10 and report back.

Aug 27 '20 16:08 youyu3

there is some data in PR #2646 (and test code linked in #2286)

Aug 27 '20 19:08 martin-frbg

Well, it turns out that I was using 0.3.10. But I have some more observations as shown in the below plots. dgetrf tests (left panel) were run with a 5000x5000 matrix, and dgemm tests (right panel) were run with two 5000x5000 matrices. Plotted are elapsed time inversed in inverse-seconds to better show the scalability.

For dgemm, OpenBLAS and MKL have about the same performance for any given number of threads;
For dgetrf, OpenBLAS is only slightly (8%) slower in the single thread case, but its scalability is much worse with increasing numbers of threads.

dgetrf-dgemm-performance

I can run a profiler on dgetrf and find out which part doesn't scale.

Aug 28 '20 06:08 youyu3

Interesting, thanks - GETRF is one of the few LAPACK functions that are reimplemented (lapack/getrf/getrf_parallel.c, already in the original GotoBLAS) rather than copied from the reference implementation. There were some fixes to my previous heavy-handed approach to making it thread-safe in february, perhaps there is more wrong with it. Also the DTRSM it calls is not optimized for SkylakeX (and neither is LASWP, another reimplemented function).

Aug 28 '20 06:08 martin-frbg

Yes I noticed that OpenBLAS DGETRF is much faster than the netlib implementation, but as we see here is still not as fast as MKL.

Aug 28 '20 15:08 youyu3