gemm_s8s8s32: latest version is 1.4x slower than v0.21 with Intel MKL on AVX2
Summary
https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL.
In our specific case, this is blocking an upgrade to a newer version of DNNL as Intel MKL offers better performance and a wider range of optimized instruction sets.
Version
The benchmark was run with the latest commit on master: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe.
Environment
- CPU make and model
- Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
- OS version:
Linux minigpu 4.4.0-166-generic #195-Ubuntu SMP Tue Oct 1 09:35:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux - Compiler version: 5.5.0
- CMake version: 3.5.1
- CMake output log: (not relevant)
- git hash: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe
Steps to reproduce
Here is the code that I used for benchmarking:
#include <algorithm>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <mkldnn.hpp>
#define BENCHMARK(fun_call, samples) \
do { \
std::cerr << "benchmarking "#fun_call << std::endl; \
for (size_t i = 0; i < 10; ++i) \
fun_call; \
std::chrono::high_resolution_clock::time_point t1 = \
std::chrono::high_resolution_clock::now(); \
for (size_t i = 0; i < static_cast<size_t>(samples); ++i) \
fun_call; \
std::chrono::high_resolution_clock::time_point t2 = \
std::chrono::high_resolution_clock::now(); \
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count(); \
std::cerr << "avg " \
<< static_cast<double>(duration) / (samples * 1000) \
<< " ms" << std::endl; \
} while (false)
template <typename T>
static T* create_buffer(int size, T value = T()) {
T* buffer = static_cast<T*>(aligned_alloc(64, size * sizeof (T)));
std::fill(buffer, buffer + size, value);
return buffer;
}
int main() {
const int m = 64;
const int n = 2048;
const int k = 512;
const char* transa = "N";
const char* transb = "T";
const char* offsetc = "F";
const int lda = *transa == 'T' ? m : k;
const int ldb = *transb == 'T' ? k : n;
const int ldc = n;
const float alpha = 1;
const float beta = 0;
int8_t ao = 0;
int8_t bo = 0;
int32_t co = 0;
auto* a = create_buffer<int8_t>(m * k, 1);
auto* b = create_buffer<int8_t>(n * k, 2);
auto* c = create_buffer<int32_t>(m * n);
#if MKLDNN_VERSION_MAJOR < 1
// mkl-dnn 0.x assumes column major order so swap a and b accordingly.
BENCHMARK(mkldnn_gemm_s8s8s32(transb, transa, offsetc,
&n, &m, &k,
&alpha,
b, &ldb, &bo,
a, &lda, &ao,
&beta,
c, &ldc, &co),
50000);
#else
BENCHMARK(mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc,
m, n, k,
alpha,
a, lda, ao,
b, ldb, bo,
beta,
c, ldc, &co),
50000);
#endif
free(a);
free(b);
free(c);
}
Compilation with v0.21.2
-
cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/0.21.2 -DARCH_OPT_FLAGS="" -DMKLROOT=/opt/intel/mkl -DMKLDNN_USE_MKL=FULL:STATIC -DMKLDNN_THREADING=OMP:INTEL -DWITH_TEST=OFF -DWITH_EXAMPLE=OFF .. -
g++ -std=c++17 -O3 -I$PWD/install/0.21.2/include -L$PWD/install/0.21.2/lib -Wl,-rpath,$PWD/install/0.21.2/lib -L/opt/intel/lib/intel64/ -Wl,-rpath,/opt/intel/lib/intel64/ -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn -liomp5
Compilation with d279b39d978b7d3d3f4f69a427f4cb91d754b9fe
-
cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/d279b39d -DDNNL_ARCH_OPT_FLAGS="" -DDNNL_BUILD_TESTS=OFF -DDNNL_BUILD_EXAMPLES=OFF .. -
g++ -std=c++17 -O3 -I$PWD/install/d279b39d/include -L$PWD/install/d279b39d/lib -Wl,-rpath,$PWD/install/d279b39d/lib -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn
Execution
OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8
Observed behavior
Here are the observed results on my system:
- v0.21.2:
avg 0.217827 ms - d279b39d978b7d3d3f4f69a427f4cb91d754b9fe:
avg 0.302498 ms
Expected behavior
The execution should ideally be faster or as fast as an older version.
Do you think performance of gemm_s8s832 on AVX2 could be improved in the future? If not, what are you recommendations to reach the performance of v0.21 but without Intel MKL?
Thanks for your time.
Reproduced for 274be82 but not for tip of master. Is there a particular reason you cannot use the latest revision?
I was able to reproduce it with d279b39. We have a fix for it already it should be out soon. As an workaround you you can add -DCMAKE_BUILD_TYPE=Release to the cmake line. If that works for you I will close this issue.
mkl-dnn was compiled with the default build type which is Release:
https://github.com/intel/mkl-dnn/blob/master/CMakeLists.txt#L50-L53
You are correct, that is the default build mode. This is build issue related.
Even though that's the default build mode, it seems s8s8 gemm (mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp) file was not being optimized.
Relying in default mode:
g++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/build_bisect/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/common -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/xbyak -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
Using -DCMAKE_BUILD_TYPE=Release:
g++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -Ilrepos/mkl-dnn/build_bisect/include -Ilrepos/mkl-dnn/src -Ilrepos/mkl-dnn/src/common -Ilrepos/mkl-dnn/src/cpu -Ilrepos/mkl-dnn/src/cpu/xbyak -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
In other words, build line doesn't contain -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 which makes function slow. This is fixed in internal repository and will be push out as soon as possible.
Please use the workaround (-DCMAKE_BUILD_TYPE=Release) if possible.
Thanks for the details! I verified that the file mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp is compiled with optimization flags:
cd /home/klein/dev/mkl-dnn/build/src/cpu && /usr/bin/c++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/home/klein/dev/mkl-dnn/include -I/home/klein/dev/mkl-dnn/build/include -I/home/klein/dev/mkl-dnn/src -I/home/klein/dev/mkl-dnn/src/common -I/home/klein/dev/mkl-dnn/src/cpu -I/home/klein/dev/mkl-dnn/src/cpu/xbyak -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /home/klein/dev/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp
However, that does not seem to change the benchmark numbers reported above.
I was able to reproduce it with d279b39
@aaraujom What was the speed difference that you measured before and after the build fix?
The difference was approximately 3x slower when not using optimizations. If both builds have optimizations I don't see significant difference in my system. In other words I can't reproduce it.
Using v0.21.2:
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking FUN(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.59345 ms
Using d279b39:
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking FUN(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.610478 ms
You might want to try the following to root cause the issue:
- Set the affinity:
OMP_PLACES=cores OMP_PROC_BIND=closefor GNU OpenMP orKMP_AFFINITY=compact,granulariy=finefor Intel OpenMP. - For best performance you can preload Intel OpenMP:
LD_PRELOAD=/path/to/libiomp5.so ./your_executable.
This will help to get more stable result and eliminate the possibility of performance differences due to threading library used.
I compiled MKL-DNN v0.21.2 with -DMKLDNN_THREADING=OMP:COMP to use the same OpenMP runtime and enabled OMP_PLACES=cores OMP_PROC_BIND=close during execution. The benchmark numbers are still in the same order of magnitude:
$ ldd benchmark_gemm_s8s8_old
linux-vdso.so.1 => (0x00007fff79192000)
libmkldnn.so.0 => /home/klein/dev/mkl-dnn/install/0.21.2/lib/libmkldnn.so.0 (0x00007f4bf8ae7000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4bf8704000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bf833a000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4bf8136000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4bf7e2d000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f4bf7bf6000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4bf79de000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4bf77c1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f4bfb75d000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.208248 ms
$ ldd benchmark_gemm_s8s8_new
linux-vdso.so.1 => (0x00007ffd41940000)
libdnnl.so.1 => /home/klein/dev/mkl-dnn/install/d279b39d/lib/libdnnl.so.1 (0x00007fb0660e6000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb065d03000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb065939000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb06571c000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb065518000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb06520f000)
libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb064fd8000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb064dc0000)
/lib64/ld-linux-x86-64.so.2 (0x00007fb067634000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.299706 ms
I guess this issue can be closed if it is not reproducible.
@guillaumekln I was able to reproduce it on a i7-6700K. There seems to be 2 issues, one with the build system and one with s8s8 gemm. Thank you for your persistence, I will investigate.
$ make run
echo "Running benchmark for v0.21.2"
Running benchmark for v0.21.2
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.249923 ms
echo "Running benchmark for d279b39"
Running benchmark for d279b39
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.359445 ms
I did a quick investigation. It seems this is a real issue, but not so simple to fix it since it depends on how memory is manage in DNNL.
As a workaround, instead of limiting to only 4 threads, use all the 8 threads available.
While MKL performs best if using the exact number of cores of your machine (4) instead of all the hyperthreads (8), for d279b39, using all hyperthreads gives best and comparable performance to v0.21.2:
4 threads v0.21.2
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.248977 ms
8 threads v0.21.2
$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg 0.448652 ms
4 threads d279b39
$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.359389 ms
8 threads d279b39
$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg 0.243672 ms
Thanks for the details. Do you expect this to change in the future or should we simply assume it is how DNNL operates?
We certainly want to resolve this. We just are still bikeshedding the solution :) In DNNL we have scratchpads as well, so we are not sure if we need a memory manager or need to change the GEMM API.
Upon further investigation the rootcase is in block size. Reassigning to @aaraujom for analysis and fix.
I'm curious, just reading this, if OMP_NUM_THREADS=4 for this test is thought to imply that thread assignment is one per core ... Does this problem disappear if hyper-threading is disabled?