oneDNN Weight Decompression and BF16 performance comparison

Executed BenchDNN (main) for MatMul weight decompression (src:bf16, wei:s4) and compared with BF16 (src:bf16, wei:bf16) MatMul with different tags (ab, any) for some M,N,K dimensions and observed the following behavior

The performance of weight decompressed (s4) MatMul with tag::ab :-

lower than BF16 MatMul (with tag::any)
better than BF16 MatMul (with tag::ab )

Logs with different tags (tag::any and tag::ab):

Weight decompression with tag::any is not performing good. Noticed that it is executing with ref:any onednn_verbose,v1,primitive,exec,cpu,matmul,ref:any,undef,src:bf16:a:blocked:ab::f0 wei:s4:a:blocked:ab::f0 dst:bf16:a:blocked:ab::f0,attr-fpmath:bf16:true attr-scales:wei:2:f32,,4x4096:4096x4096,1058.59
Weight decompression with tag::ab. onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_bf16,undef,src:bf16:a:blocked:ab::f0 wei:s4::blocked:ab::f0 dst:bf16:a:blocked:ab::f0,attr-fpmath:bf16:true attr-scales:wei:2:f32,,4x4096:4096x4096,0.0419922
BF16 MatMul (with tag::any). Noticed the weight format (BA16a32b2a) onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_bf16,undef,src:bf16:a:blocked:ab::f0 wei:bf16:a:blocked:BA16a32b2a::f0 dst:bf16:a:blocked:ab::f0,attr-fpmath:bf16:true attr-scales:wei:2:f32,,4x4096:4096x4096,0.0129395
BF16 MatMul (with tag::ab) onednn_verbose,v1,primitive,exec,cpu,matmul,brg_matmul:avx512_core_bf16,undef,src:bf16:a:blocked:ab::f0 wei:bf16::blocked:ab::f0 dst:bf16:a:blocked:ab::f0,attr-fpmath:bf16:true attr-scales:wei:2:f32,,4x4096:4096x4096,0.0639648

Sample commands:

Weight decompression ./benchdnn --matmul --fix-times-per-prb=100 --mode=p --dt=bf16:s4:bf16 --wtag=ab --attr-fpmath=bf16:true --attr-scales=wei:per_oc:f32 4x4096:4096x4096
BF16 ./benchdnn --matmul --fix-times-per-prb=100 --mode=p --dt=bf16:bf16:bf16 --wtag=any --attr-fpmath=bf16:true --attr-scales=wei:per_oc:f32 4x4096:4096x4096

Questions:

Is there a way to execute weight decompression with tag::any that will be equivalent to BF16 (tag::any) ?
Are there other arguments that needs to be added for weight decompression ?

Oct 16 '24 06:10 AyuBaiswar

Hi @AyuBaiswar! I could replicate your findings. Please note that currently int4 weights decompression only supports plain and transpose layouts. Could you give us more context for these tests?

Oct 16 '24 12:10 raistefintel

In addition to @raistefintel comments: tag:any support for weights is lacking due to lack of blocked format weights for int4. In that regard, application would already have a format for int4 weights specified, either ab or ba and the format must be forced so far. We can add initialization into one of those formats but it won't add much for the application. It may only trigger an extra data conversion from one format to the other one (which the library actually doesn't support either). So forcing a format is the right way to go so far.

As for the perf data itself, I'd say you picked a small size to observe any benefit in the configuration you are benchmarking it. To simulate a mode when weights are coming from the memory, use --cold-cache=wei knob for benchdnn, that would trigger the difference in memory usage.

If you don't want to use the knob, increase N dimension so that it doesn't fit the socket combined L2 + L3 caches with guarantee, like x10 times.

I see avx512_core_bf16 in the log. Is it an SPR system?

Oct 16 '24 18:10 dzarukin

Closing as stale. Please feel free to reopen it with additional info.

Oct 29 '24 08:10 raistefintel