Summary

https://github.com/intel/mkl-dnn/commit/274be8228a0dba6391c2769c37cd68a3bb730fbf added AVX2 optimizations for igemm kernels (as discussed in https://github.com/intel/mkl-dnn/issues/532). However, the execution appears to be 1.4x slower than using version v0.21 compiled with Intel MKL.

In our specific case, this is blocking an upgrade to a newer version of DNNL as Intel MKL offers better performance and a wider range of optimized instruction sets.

Version

The benchmark was run with the latest commit on master: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe.

Environment

CPU make and model
- Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
- fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch invpcid_single intel_pt ssbd ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
OS version: Linux minigpu 4.4.0-166-generic #195-Ubuntu SMP Tue Oct 1 09:35:25 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Compiler version: 5.5.0
CMake version: 3.5.1
CMake output log: (not relevant)
git hash: d279b39d978b7d3d3f4f69a427f4cb91d754b9fe

Steps to reproduce

Here is the code that I used for benchmarking:

#include <algorithm>
#include <cstdlib>
#include <chrono>
#include <iostream>
#include <mkldnn.hpp>

#define BENCHMARK(fun_call, samples)                                    \
  do {                                                                  \
    std::cerr << "benchmarking "#fun_call << std::endl;                 \
    for (size_t i = 0; i < 10; ++i)                                     \
      fun_call;                                                         \
    std::chrono::high_resolution_clock::time_point t1 =                 \
      std::chrono::high_resolution_clock::now();                        \
    for (size_t i = 0; i < static_cast<size_t>(samples); ++i)           \
      fun_call;                                                         \
    std::chrono::high_resolution_clock::time_point t2 =                 \
      std::chrono::high_resolution_clock::now();                        \
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count(); \
    std::cerr << "avg   "                                               \
      << static_cast<double>(duration) / (samples * 1000)               \
              << " ms" << std::endl;                                    \
  } while (false)

template <typename T>
static T* create_buffer(int size, T value = T()) {
  T* buffer = static_cast<T*>(aligned_alloc(64, size * sizeof (T)));
  std::fill(buffer, buffer + size, value);
  return buffer;
}

int main() {
  const int m = 64;
  const int n = 2048;
  const int k = 512;

  const char* transa = "N";
  const char* transb = "T";
  const char* offsetc = "F";

  const int lda = *transa == 'T' ? m : k;
  const int ldb = *transb == 'T' ? k : n;
  const int ldc = n;

  const float alpha = 1;
  const float beta = 0;

  int8_t ao = 0;
  int8_t bo = 0;
  int32_t co = 0;

  auto* a = create_buffer<int8_t>(m * k, 1);
  auto* b = create_buffer<int8_t>(n * k, 2);
  auto* c = create_buffer<int32_t>(m * n);

#if MKLDNN_VERSION_MAJOR < 1
  // mkl-dnn 0.x assumes column major order so swap a and b accordingly.
  BENCHMARK(mkldnn_gemm_s8s8s32(transb, transa, offsetc,
                                &n, &m, &k,
                                &alpha,
                                b, &ldb, &bo,
                                a, &lda, &ao,
                                &beta,
                                c, &ldc, &co),
            50000);
#else
  BENCHMARK(mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc,
                                m, n, k,
                                alpha,
                                a, lda, ao,
                                b, ldb, bo,
                                beta,
                                c, ldc, &co),
            50000);
#endif

  free(a);
  free(b);
  free(c);
}

Compilation with v0.21.2

cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/0.21.2 -DARCH_OPT_FLAGS="" -DMKLROOT=/opt/intel/mkl -DMKLDNN_USE_MKL=FULL:STATIC -DMKLDNN_THREADING=OMP:INTEL -DWITH_TEST=OFF -DWITH_EXAMPLE=OFF ..
g++ -std=c++17 -O3 -I$PWD/install/0.21.2/include -L$PWD/install/0.21.2/lib -Wl,-rpath,$PWD/install/0.21.2/lib -L/opt/intel/lib/intel64/ -Wl,-rpath,/opt/intel/lib/intel64/ -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn -liomp5

Compilation with d279b39d978b7d3d3f4f69a427f4cb91d754b9fe

cmake -DCMAKE_INSTALL_PREFIX=$PWD/../install/d279b39d -DDNNL_ARCH_OPT_FLAGS="" -DDNNL_BUILD_TESTS=OFF -DDNNL_BUILD_EXAMPLES=OFF ..
g++ -std=c++17 -O3 -I$PWD/install/d279b39d/include -L$PWD/install/d279b39d/lib -Wl,-rpath,$PWD/install/d279b39d/lib -o benchmark_gemm_s8s8 benchmark_gemm_s8s8.cc -lmkldnn

Execution

OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8

Observed behavior

Here are the observed results on my system:

v0.21.2: avg 0.217827 ms
d279b39d978b7d3d3f4f69a427f4cb91d754b9fe: avg 0.302498 ms

Expected behavior

The execution should ideally be faster or as fast as an older version.

Do you think performance of gemm_s8s832 on AVX2 could be improved in the future? If not, what are you recommendations to reach the performance of v0.21 but without Intel MKL?

Thanks for your time.

Dec 18 '19 09:12 guillaumekln

Reproduced for 274be82 but not for tip of master. Is there a particular reason you cannot use the latest revision?

Dec 19 '19 02:12 rsdubtso

I was able to reproduce it with d279b39. We have a fix for it already it should be out soon. As an workaround you you can add -DCMAKE_BUILD_TYPE=Release to the cmake line. If that works for you I will close this issue.

Dec 19 '19 05:12 aaraujom

mkl-dnn was compiled with the default build type which is Release:

https://github.com/intel/mkl-dnn/blob/master/CMakeLists.txt#L50-L53

Dec 19 '19 07:12 guillaumekln

You are correct, that is the default build mode. This is build issue related.

Even though that's the default build mode, it seems s8s8 gemm (mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp) file was not being optimized.

Relying in default mode:

g++   -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -fopenmp -Wmissing-field-initializers  -Wno-strict-overflow                                   -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/build_bisect/include -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/common -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/xbyak    -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /nfs/pdx/home/aaraujom/lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

Using -DCMAKE_BUILD_TYPE=Release:

g++   -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -std=c++11 -fvisibility-inlines-hidden  -Wall -Wno-unknown-pragmas -fvisibility=internal  -fPIC -Wformat -Wformat-security -fstack-protector-strong  -fopenmp -Wmissing-field-initializers  -Wno-strict-overflow  -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -I/nfs/pdx/home/aaraujom/lrepos/mkl-dnn/include -Ilrepos/mkl-dnn/build_bisect/include -Ilrepos/mkl-dnn/src -Ilrepos/mkl-dnn/src/common -Ilrepos/mkl-dnn/src/cpu -Ilrepos/mkl-dnn/src/cpu/xbyak    -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c lrepos/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

In other words, build line doesn't contain -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 which makes function slow. This is fixed in internal repository and will be push out as soon as possible.

Please use the workaround (-DCMAKE_BUILD_TYPE=Release) if possible.

Dec 19 '19 19:12 aaraujom

Thanks for the details! I verified that the file mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp is compiled with optimization flags:

cd /home/klein/dev/mkl-dnn/build/src/cpu && /usr/bin/c++ -DDNNL_DLL -DDNNL_DLL_EXPORTS -DDNNL_ENABLE_MAX_CPU_ISA -D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS -I/home/klein/dev/mkl-dnn/include -I/home/klein/dev/mkl-dnn/build/include -I/home/klein/dev/mkl-dnn/src -I/home/klein/dev/mkl-dnn/src/common -I/home/klein/dev/mkl-dnn/src/cpu -I/home/klein/dev/mkl-dnn/src/cpu/xbyak -std=c++11 -fvisibility-inlines-hidden -Wall -Wno-unknown-pragmas -fvisibility=internal -fPIC -Wformat -Wformat-security -fstack-protector-strong -fopenmp -Wmissing-field-initializers -Wno-strict-overflow -O3 -DNDEBUG -D_FORTIFY_SOURCE=2 -o CMakeFiles/dnnl_cpu.dir/gemm/s8x8s32/simple_gemm_s8s8s32.cpp.o -c /home/klein/dev/mkl-dnn/src/cpu/gemm/s8x8s32/simple_gemm_s8s8s32.cpp

However, that does not seem to change the benchmark numbers reported above.

I was able to reproduce it with d279b39

@aaraujom What was the speed difference that you measured before and after the build fix?

Dec 19 '19 20:12 guillaumekln

The difference was approximately 3x slower when not using optimizations. If both builds have optimizations I don't see significant difference in my system. In other words I can't reproduce it.

Using v0.21.2:

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking FUN(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.59345 ms

Using d279b39:

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking FUN(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.610478 ms

You might want to try the following to root cause the issue:

Set the affinity: OMP_PLACES=cores OMP_PROC_BIND=close for GNU OpenMP or KMP_AFFINITY=compact,granulariy=fine for Intel OpenMP.
For best performance you can preload Intel OpenMP: LD_PRELOAD=/path/to/libiomp5.so ./your_executable .

This will help to get more stable result and eliminate the possibility of performance differences due to threading library used.

Dec 19 '19 21:12 aaraujom

I compiled MKL-DNN v0.21.2 with -DMKLDNN_THREADING=OMP:COMP to use the same OpenMP runtime and enabled OMP_PLACES=cores OMP_PROC_BIND=close during execution. The benchmark numbers are still in the same order of magnitude:

$ ldd benchmark_gemm_s8s8_old 
        linux-vdso.so.1 =>  (0x00007fff79192000)
        libmkldnn.so.0 => /home/klein/dev/mkl-dnn/install/0.21.2/lib/libmkldnn.so.0 (0x00007f4bf8ae7000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f4bf8704000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4bf833a000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4bf8136000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4bf7e2d000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007f4bf7bf6000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f4bf79de000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4bf77c1000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f4bfb75d000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_old
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.208248 ms

$ ldd benchmark_gemm_s8s8_new
        linux-vdso.so.1 =>  (0x00007ffd41940000)
        libdnnl.so.1 => /home/klein/dev/mkl-dnn/install/d279b39d/lib/libdnnl.so.1 (0x00007fb0660e6000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fb065d03000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb065939000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb06571c000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb065518000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb06520f000)
        libgomp.so.1 => /usr/lib/x86_64-linux-gnu/libgomp.so.1 (0x00007fb064fd8000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb064dc0000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fb067634000)
$ OMP_PLACES=cores OMP_PROC_BIND=close OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_new
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.299706 ms

I guess this issue can be closed if it is not reproducible.

Dec 20 '19 08:12 guillaumekln

@guillaumekln I was able to reproduce it on a i7-6700K. There seems to be 2 issues, one with the build system and one with s8s8 gemm. Thank you for your persistence, I will investigate.

$ make run
echo "Running benchmark for v0.21.2"
Running benchmark for v0.21.2
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
                        ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.249923 ms
echo "Running benchmark for d279b39"
Running benchmark for d279b39
OMP_NUM_THREADS=4 OMP_PLACES=cores OMP_PROC_BIND=close \
                        ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.359445 ms

Dec 21 '19 02:12 aaraujom

I did a quick investigation. It seems this is a real issue, but not so simple to fix it since it depends on how memory is manage in DNNL.

As a workaround, instead of limiting to only 4 threads, use all the 8 threads available.

While MKL performs best if using the exact number of cores of your machine (4) instead of all the hyperthreads (8), for d279b39, using all hyperthreads gives best and comparable performance to v0.21.2:

4 threads v0.21.2

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.248977 ms

8 threads v0.21.2

$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_v0212
benchmarking mkldnn_gemm_s8s8s32(transb, transa, offsetc, &n, &m, &k, &alpha, b, &ldb, &bo, a, &lda, &ao, &beta, c, &ldc, &co)
avg   0.448652 ms

4 threads d279b39

$ OMP_NUM_THREADS=4 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.359389 ms

8 threads d279b39

$ OMP_NUM_THREADS=8 ./benchmark_gemm_s8s8_d279b39
benchmarking mkldnn_gemm_s8s8s32(*transa, *transb, *offsetc, m, n, k, alpha, a, lda, ao, b, ldb, bo, beta, c, ldc, &co)
avg   0.243672 ms

Jan 02 '20 21:01 aaraujom

Thanks for the details. Do you expect this to change in the future or should we simply assume it is how DNNL operates?

Jan 13 '20 15:01 guillaumekln

We certainly want to resolve this. We just are still bikeshedding the solution :) In DNNL we have scratchpads as well, so we are not sure if we need a memory manager or need to change the GEMM API.

Jan 14 '20 18:01 rsdubtso

Upon further investigation the rootcase is in block size. Reassigning to @aaraujom for analysis and fix.

Mar 09 '20 21:03 vpirogov

I'm curious, just reading this, if OMP_NUM_THREADS=4 for this test is thought to imply that thread assignment is one per core ... Does this problem disappear if hyper-threading is disabled?

Mar 02 '21 04:03 jnorwood

gemm_s8s8s32: latest version is 1.4x slower than v0.21 with Intel MKL on AVX2

Summary

Version

Environment

Steps to reproduce

Compilation with v0.21.2

Compilation with d279b39d978b7d3d3f4f69a427f4cb91d754b9fe

Execution

Observed behavior

Expected behavior