oneDNN DNNL vs FBGEMM u8s8s32 with small m

I noticed that DNNL u8s8s32 single core performance is slower than FBGEMM when m is small (m<256), but faster when m is large.

e.g. for m=16, k=n=768 I get 0.188ms vs 0.158ms for DNNL and FBGEMM respectively and for m=1024, k=n=768 I get 7.5ms vs 8.4ms

Could this be caused by unfortunate choice of hyperparameters? Are there any plans to improve performance for such cases?

I'm using DNNL v1.2 with an AVX512 machine.

Mar 18 '20 19:03 ppetrushkov

@ppetrushkov,

Thank you for the report. We continuously improving library functionality to deliver the best possible performance. You may want to check the head of rls-v1.2 branch which has a fix improving performance of small GEMMs in single threaded mode.

@aaraujom, could you please look at the case above?

Mar 18 '20 20:03 vpirogov

Hi @ppetrushkov,

You may want to check the head of rls-v1.2

Or maybe just master.

Could you please also specify what system you use? I ran u8s8s32 gemm on SKX 8180 and got 0.128833 ms, so want to make sure that there is no issue on the building / environment side.

$ OMP_NUM_THREADS=1 ./benchdnn.icc190.orig --matmul --mode=P --cfg=u8s8s32 m16n768k768 m1024n768k768 )
Output template: perf,%engine%,%name%,%prb%,%Gops%,%Gfreq%,%-time%,%-Gflops%,%0time%,%0Gflops%
perf,cpu,,--matmul --cfg=u8s8s32 m16n768k768,0.0188744,0,0.128662,146.697,0.129629,145.603
perf,cpu,,--matmul --cfg=u8s8s32 m1024n768k768,1.20796,0,6.00977,200.999,6.02748,200.409
tests:2 passed:0 skipped:0 mistrusted:0 unimplemented:0 failed:0 listed:0
total perf: min(ms):6.13843 avg(ms):6.15711

Mar 18 '20 20:03 emfomenk

Hi @ppetrushkov Difference seems small. Is it possible to provide a reproducer?

Mar 18 '20 20:03 aaraujom

Hi @emfomenk I think I tried with a master version some time ago and it didn't change. I'll try again and report back. This particular test was on Intel(R) Xeon(R) Gold 6138, but I observed similar behaviour on other systems as well. I'm building DNNL with cmake -DCMAKE_BUILD_TYPE=Release (OpenMP and gcc 7.5.0). Should I expect any difference from using TBB and/or Intel compiler?

@aaraujom although the absolute numbers are small, it ranges between 15-30% slowdown for slightly different sizes. I don't have a minimal working example right now, I'll get back to you in a day or so.

Mar 18 '20 21:03 ppetrushkov

@aaraujom @emfomenk I prepared a minimal example: https://gist.github.com/ppetrushkov/694bc7ec0f7663c63e067e9ecfdc7d99 It can be compiled with:

g++ -Wall -Wextra -I $(DNNL_INCLUDE_DIR) -I $(FBGEMM_INCLUDE_DIR) -L $(DNNL_LIB_DIR) -L $(FBGEMM_LIB_DIR) -o minimal_fbgemm minimal_fbgemm.cpp -ldnnl -lfbgemm -lcpuinfo -lclog -lpthread -lasmjit -lrt

I'm using master branches for both DNNL and FBGEMM. For sizes e.g. m=16 k=768 n=3072 I get:

DNNL Average time = 773.714
FBGEMM Average time = 544.524

While for m=1024 k=768 n=3072 I get:

DNNL Average time = 31185.3
FBGEMM Average time = 33785.7

One notable difference between DNNL and FBGEMM is that FBGEMM pre-packs the weight matrix and pre-computes the column offsets only once, while I assume DNNL does something similar on each run. However, this doesn't explain why DNNL actually becomes faster at some point.

Mar 19 '20 15:03 ppetrushkov

Hi @ppetrushkov - Thank you very much for the reproducer. dnnl_gemm_u8s8s32 doesn't pre-pack matrices. So we are paying the cost of packing at each call. We have a pack-api where we can pre-pack matrices, but it is used internally only. For larger problem sizes we can reuse the packed data which would justify the performance gains.

I will check with internal pack-api for 16 x 768 x 768 problem size and see if it can be improved vs FBGEMM.

Thanks again for the reproducer.

Mar 19 '20 18:03 aaraujom

@emfomenk, @aaraujom, do we expose pack API via the matmul primitive?

Mar 19 '20 19:03 vpirogov

Not yet. Same holds for Inner Product.

Mar 19 '20 19:03 emfomenk

Closing as stale. Current oneDNN implementation provides weight packing capability with matmul primitive.

Mar 29 '24 19:03 vpirogov