Benchmark OpenBLAS, Intel MKL vs ATLAS
Hi,
This is not a problem report, but I'd like to share my benchmark of LAPACK / BLAS library. Because of my huge simulation model, I have been replacing my CPU and math-library. My conclusion is that Intel MKL is the best, OpenBLAS is worth to try.
| No. of surface-patch | Memory | 3.16 GHz Core2 Duo | 3.0 GHz Core2 Quad |
|---|---|---|---|
| (GB) | ATLAS (sec) | OpenBLAS (sec) | |
| 6,319 | 2.5 | 360 | 135 |
| 9,968 | 6.0 | 1,380 | 510 |
| 13,992 | 11.8 | 3,600 | 1,360 |
The simulation model (smooth-walled 3-section conical horn antenna) consists of surface-patch by SP&SC. Total run-time is measured by gettimeofday() instead of sysconf(). Note that OpenBLAS performs more than the ratio of CPU's core (Duo vs Quad). As shown in the Flat profile below, 90 % of the calculation is zgemm_kernel_n to be parallel by multi-core.
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
89.99 289.89 289.89 zgemm_kernel_n
3.25 300.37 10.48 sched_yield
1.62 305.60 5.23 ztrsm_kernel_LT
1.45 310.26 4.66 inner_advanced_thread
0.73 312.61 2.35 39929761 0.00 0.00 nec_context::hintg(double, double, double)
matrix_algebra.cpp is modified for OpenBLAS:
extern "C"
{
#include </usr/lib/openblas-base/include/lapacke.h>
#include </usr/lib/openblas-base/include/cblas.h>
}
int info = LAPACKE_zgetrf((int) CblasColMajor, (lapack_int) n, (lapack_int) n, (lapack_complex_double*) a_in.data(), (lapack_int) ndim, (lapack_int*) ip.data());
int info = LAPACKE_zgetrs ((int) CblasColMajor, (char) CblasNoTrans, (lapack_int) n, (lapack_int) 1, (const lapack_complex_double*) a.data(), (lapack_int) ndim, (const lapack_int*) ip.data(), (lapack_complex_double*) b.data(), (lapack_int) n);
With regard to Transposed matrix, zgetrs.c of OpenBLAS is modified also:
if (trans_arg = = ‘O’) trans = 0;
if (trans_arg = = ‘P’) trans = 1;
if (trans_arg = = ‘Q’) trans = 2;
if (trans_arg = = ‘R’) trans = 3;
This is a dirty solution. It would be appreciated if someone suggest a better solution. OpenBLAS is superb, but I experienced Memory Seg-Fault in case of over-60GB memory usage and 8 core CPU. Though I've confirmed that this Seg-Fault is NOT caused by NEC2++, but fixing the problem of OpenBLAS was beyond my capability. Then, I migrated to Intel MKL.
| No. of surface-patch | Memory | 3.0 GHz Core2 Quad | 2.93 GHz Dual X5570 (8 cores) | 2.93 GHz Dual X5570 (8 cores) |
|---|---|---|---|---|
| (GB) | OpenBLAS (sec) | OpenBLAS (sec) | Intel MKL (sec) | |
| 6,319 | 2.5 | 135 | 68 | 67 |
| 9,968 | 6.0 | 510 | 253 | 247 |
| 13,992 | 11.8 | 1,360 | 669 | 663 |
| 19,096 | 21.9 | - | 1,671 | 1,663 |
| 24,957 | 37.3 | - | 3,760 | 3,659 |
| 31,641 | 59.9 | - | 7,633 | 7,417 |
| 39,117 | 91.5 | - | Seg-Fault | 14,004 |
matrix_algebra.cpp is modified for Intel MKL:
#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_lapacke.h>
#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_cblas.h>
int info = LAPACKE_zgetrf (CblasColMajor, n, n, (MKL_Complex16*) a_in.data(), ndim, (int*) ip.data());
int info = LAPACKE_zgetrs (CblasColMajor, ‘N’, n, 1, (const MKL_Complex16*) a.data(), ndim, (const int*) ip.data(), (MKL_Complex16*) b.data(), n);
Link options are:
-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include
Intel Math Kernel Library Link Line Advisor suggests these options. I used a little bit older version of the resources. NEC2++ : ver.1.5.1 OpenBLAS : ver.2.5 Intel MKL : ver.11.1 gcc : ver.4.7.2 icc : ver.13.0.1
I hope this may help your serious number-crunching.
Best regards.
Yoshi Takeyasu
Hi Yoshi
This is very interesting information. I have been working on getting necpp to work with Eigen (eigen.tuxfamily.org), however it has been difficult because eigen aligns rows and columns of matrices with 4-byte address boundaries. I will keep trying, as it will make an interesting comparison as well.
Kind Regards
Tim Molteno
you run atlas on a dual core machine and the other on quad so really if adjust the atlas numbers they are about the same. Plus what is cost of openblas? 0 MKL? ++$$$$ i'll stick with openblas
i'd like to see you benchmark plasma, libblis and libflame. i think these will be faster than openblas as they have been updated with current kernels and throw in an openCL comparison if you can. also try an openmpi lib
openblas doesn't have a lot of kernels to tune, so when it ran its generic x86_64 configure it probably didn't determine your cache size correctly and is bombing when malloc returns a null pointer. probably.
http://gcdart.blogspot.jp/2013/06/fast-matrix-multiply-and-ml.html
This is a good reference for the discussion.
Just FYI. MKL is now FREE, free as in free beer, or a free couch on the side of the road used by a guy who looks like Homer Simpson, or free as in the US' ideology on speech. https://software.intel.com/en-us/articles/free_mkl *Disclaimer: These words above are my own and do not reflect the opinion or ideals of Intel. This is not endorsed by any entity.
@ytakeyasu Can you share the full compile args you used to link OpenBLAS and MKL? Thanks
Hi, As I reported in my first post, I used the Intel Math Kernel Library Link Line Advisor to find my link options. The parameters I input to the Advisor are:
• Intel (R) product : Intel(R) MKL 11.1 • OS : Linux • Usage model of Intel (R) Xeon Phi (TM) Coprocessor : None • Complier : Intel (R) C/C++ • Architecture : Intel (R) 64 • Dynamic or static linking : Static • Interface layer : LP64 ( 3-bit integer ) • Sequential or multi-threaded layer : Multi-threaded • OpenMP library : Intel (R) ( libiomp5 )
then, I got link options as follow:
-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include
Regards.
Yoshi Takeyasu
I hope this is still active. Did you install the libraries yourself, from sources, or did you use the stock atlas and openblas from a repository. ATLAS really has to be tuned to your system. The tuning can give at least factors of 2-3.