necpp Benchmark OpenBLAS, Intel MKL vs ATLAS

Hi,

This is not a problem report, but I'd like to share my benchmark of LAPACK / BLAS library. Because of my huge simulation model, I have been replacing my CPU and math-library. My conclusion is that Intel MKL is the best, OpenBLAS is worth to try.

No. of surface-patch	Memory	3.16 GHz Core2 Duo	3.0 GHz Core2 Quad
	(GB)	ATLAS (sec)	OpenBLAS (sec)
6,319	2.5	360	135
9,968	6.0	1,380	510
13,992	11.8	3,600	1,360

The simulation model (smooth-walled 3-section conical horn antenna) consists of surface-patch by SP&SC. Total run-time is measured by gettimeofday() instead of sysconf(). Note that OpenBLAS performs more than the ratio of CPU's core (Duo vs Quad). As shown in the Flat profile below, 90 % of the calculation is zgemm_kernel_n to be parallel by multi-core.

Flat profile:

Each sample counts as 0.01 seconds.
 %    cumulative    self    self    total
 time    seconds    seconds    calls    s/call    s/call    name
 89.99   289.89     289.89    zgemm_kernel_n
 3.25    300.37     10.48    sched_yield
 1.62    305.60     5.23    ztrsm_kernel_LT
 1.45    310.26     4.66    inner_advanced_thread
 0.73    312.61     2.35    39929761    0.00    0.00    nec_context::hintg(double, double, double)

matrix_algebra.cpp is modified for OpenBLAS:

extern "C"
 {
 #include </usr/lib/openblas-base/include/lapacke.h>
 #include </usr/lib/openblas-base/include/cblas.h>
 }
int info = LAPACKE_zgetrf((int) CblasColMajor, (lapack_int) n, (lapack_int) n, (lapack_complex_double*) a_in.data(), (lapack_int) ndim, (lapack_int*) ip.data());
int info = LAPACKE_zgetrs ((int) CblasColMajor, (char) CblasNoTrans, (lapack_int) n, (lapack_int) 1,  (const lapack_complex_double*) a.data(), (lapack_int) ndim, (const lapack_int*) ip.data(),  (lapack_complex_double*) b.data(), (lapack_int) n);

With regard to Transposed matrix, zgetrs.c of OpenBLAS is modified also:

if (trans_arg = = ‘O’) trans = 0;
if (trans_arg = = ‘P’) trans = 1;
if (trans_arg = = ‘Q’) trans = 2;
if (trans_arg = = ‘R’) trans = 3;

This is a dirty solution. It would be appreciated if someone suggest a better solution. OpenBLAS is superb, but I experienced Memory Seg-Fault in case of over-60GB memory usage and 8 core CPU. Though I've confirmed that this Seg-Fault is NOT caused by NEC2++, but fixing the problem of OpenBLAS was beyond my capability. Then, I migrated to Intel MKL.

No. of surface-patch	Memory	3.0 GHz Core2 Quad	2.93 GHz Dual X5570 (8 cores)	2.93 GHz Dual X5570 (8 cores)
	(GB)	OpenBLAS (sec)	OpenBLAS (sec)	Intel MKL (sec)
6,319	2.5	135	68	67
9,968	6.0	510	253	247
13,992	11.8	1,360	669	663
19,096	21.9	-	1,671	1,663
24,957	37.3	-	3,760	3,659
31,641	59.9	-	7,633	7,417
39,117	91.5	-	Seg-Fault	14,004

matrix_algebra.cpp is modified for Intel MKL:

#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_lapacke.h>
#include </opt/intel/composer_xe_201.1.117/mkl/include/mkl_cblas.h>
int info = LAPACKE_zgetrf (CblasColMajor, n, n, (MKL_Complex16*) a_in.data(), ndim, (int*) ip.data());
int info = LAPACKE_zgetrs (CblasColMajor, ‘N’, n, 1, (const MKL_Complex16*) a.data(), ndim, (const int*) ip.data(), (MKL_Complex16*) b.data(), n);

Link options are:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Intel Math Kernel Library Link Line Advisor suggests these options. I used a little bit older version of the resources. NEC2++ : ver.1.5.1 OpenBLAS : ver.2.5 Intel MKL : ver.11.1 gcc : ver.4.7.2 icc : ver.13.0.1

I hope this may help your serious number-crunching.

Best regards.

Yoshi Takeyasu

Mar 15 '15 03:03 ytakeyasu

Hi Yoshi

This is very interesting information. I have been working on getting necpp to work with Eigen (eigen.tuxfamily.org), however it has been difficult because eigen aligns rows and columns of matrices with 4-byte address boundaries. I will keep trying, as it will make an interesting comparison as well.

Kind Regards

Tim Molteno

Mar 15 '15 19:03 tmolteno

you run atlas on a dual core machine and the other on quad so really if adjust the atlas numbers they are about the same. Plus what is cost of openblas? 0 MKL? ++$$$$ i'll stick with openblas

i'd like to see you benchmark plasma, libblis and libflame. i think these will be faster than openblas as they have been updated with current kernels and throw in an openCL comparison if you can. also try an openmpi lib

openblas doesn't have a lot of kernels to tune, so when it ran its generic x86_64 configure it probably didn't determine your cache size correctly and is bombing when malloc returns a null pointer. probably.

Feb 16 '16 21:02 gaming-hacker

http://gcdart.blogspot.jp/2013/06/fast-matrix-multiply-and-ml.html

This is a good reference for the discussion.

Feb 17 '16 12:02 ytakeyasu

Just FYI. MKL is now FREE, free as in free beer, or a free couch on the side of the road used by a guy who looks like Homer Simpson, or free as in the US' ideology on speech. https://software.intel.com/en-us/articles/free_mkl *Disclaimer: These words above are my own and do not reflect the opinion or ideals of Intel. This is not endorsed by any entity.

@ytakeyasu Can you share the full compile args you used to link OpenBLAS and MKL? Thanks

Mar 28 '16 22:03 ldmtwo

Hi, As I reported in my first post, I used the Intel Math Kernel Library Link Line Advisor to find my link options. The parameters I input to the Advisor are:

• Intel (R) product : Intel(R) MKL 11.1 • OS : Linux • Usage model of Intel (R) Xeon Phi (TM) Coprocessor : None • Complier : Intel (R) C/C++ • Architecture : Intel (R) 64 • Dynamic or static linking : Static • Interface layer : LP64 ( 3-bit integer ) • Sequential or multi-threaded layer : Multi-threaded • OpenMP library : Intel (R) ( libiomp5 )

then, I got link options as follow:

-wl, --start-group $(MKLROOT)/lib/intel64/libmkl_intel_lp64.a $(MKLROOT)/lib/intel64/libmkl_core.a $(MKLROOT)/lib/intel64/libmkl_intel_thread.a -wl, --end-group lpthread -lm -openmp -$(MKLROOT)/include

Regards.

Yoshi Takeyasu

Mar 29 '16 12:03 ytakeyasu

I hope this is still active. Did you install the libraries yourself, from sources, or did you use the stock atlas and openblas from a repository. ATLAS really has to be tuned to your system. The tuning can give at least factors of 2-3.

May 23 '20 03:05 nardis-miles