LU and eigen routines slower than MKL
Hi, I'm running some performance comparisons between OpenBLAS and MKL for LU and eigen routines. I see that OpenBLAS tests with, for example, dgetrf and dsyevd, are about 3 times slower than MKL. These are multi-threaded tests ran on a Skylake machine.
I wonder if you have any benchmark results vs. MKL and if so how do they look like?
Thanks.
What skylake? What size of inputs? Which OpenBLAS version? Any virtualisation? Is DGEMM slower too? (i.e if latest improves/fixes any)
Skylake (Haswell refresh without AVX512) or SkylakeX (with AVX512) ? DGEMM performance for the latter should be about on par with MKL if you use a very recent 0.3.x release or the current develop branch. At small problem sizes MKL will probably be faster as OpenBLAS only has a single threshold for switching from using a single thread to using all available cores, while MKL seems to increase the thread count more gradually to match the workload. (Also MKL may be using a more efficient LAPACK than the "netlib" reference implementation)
Thanks. It is SkylakeX with AVX-512. Tried input sizes from m=n=5000 to m=n=15000. OpenBLAS 0.3.7 - this should have latest improvements? Can create a plot and profile DGEMM.
0.3.7 (from a year ago) had all parts of the initial AVX512 DGEMM implementation disabled as it turned out to be incorrect. AVX512 DGEMM reappeared in 0.3.8 and was further improved in 0.3.10, so ideally you should be trying that (or git develop)
Good to know, thanks! I'll try 0.3.10 and report back.
there is some data in PR #2646 (and test code linked in #2286)
Well, it turns out that I was using 0.3.10. But I have some more observations as shown in the below plots. dgetrf tests (left panel) were run with a 5000x5000 matrix, and dgemm tests (right panel) were run with two 5000x5000 matrices. Plotted are elapsed time inversed in inverse-seconds to better show the scalability.
- For
dgemm, OpenBLAS and MKL have about the same performance for any given number of threads; - For
dgetrf, OpenBLAS is only slightly (8%) slower in the single thread case, but its scalability is much worse with increasing numbers of threads.

I can run a profiler on dgetrf and find out which part doesn't scale.
Interesting, thanks - GETRF is one of the few LAPACK functions that are reimplemented (lapack/getrf/getrf_parallel.c, already in the original GotoBLAS) rather than copied from the reference implementation. There were some fixes to my previous heavy-handed approach to making it thread-safe in february, perhaps there is more wrong with it. Also the DTRSM it calls is not optimized for SkylakeX (and neither is LASWP, another reimplemented function).
Yes I noticed that OpenBLAS DGETRF is much faster than the netlib implementation, but as we see here is still not as fast as MKL.