oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

cpu: ppc64: add gemm and reorder kernels

Open Tiwari-Avanish opened this issue 9 months ago • 11 comments

Description

Implented reorder for fp32 to u8 for matmul. Implemented prepacking routine for input and output that will support u8 to s8 conversion as well, based on the data type of input. Implemented gemm driver to run parallely the gemm kernel based on the block size of matrix A and B. Improved the performance of gemm kernel.

Tiwari-Avanish avatar Apr 24 '25 03:04 Tiwari-Avanish

Hi @spalicki It took time because i was seeing some test case failure in test_gemm_u8s8s32 related to alpha and beta when it was in floating point, now that got fixed and i have added the changes into this PR.

As you have asked, i have collected the performance with onednn benchdnn.

./matmul_perf_cpp:

With the changes: 2.20503 TOp/s Example passed on CPU.


Without this changes: 1.5468 TOp/s Example passed on CPU


Other benchdnn i have run:

With the PR changes:

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192

0:PASSED (3828 ms) __REPRO: --matmul --dt=u8:s8:u8 8192x8192:8192x8192
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 3.83s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 1.72s (45%); execute: 0.36s (9%); compute_ref: 1.36s (35%); compare: 0.38s (10%);



./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096

total: 0.75s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.39s (52%); execute: 0.05s (7%); compute_ref: 0.21s (27%); compare: 0.09s (12%);

Without this pr changes:

./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 8192x8192:8192x8192

0:PASSED (3901 ms) __REPRO: --matmul --dt=u8:s8:u8 8192x8192:8192x8192
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 3.90s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 1.56s (40%); execute: 0.69s (18%); compute_ref: 1.36s (35%); compare: 0.28s (7%);


In both you can see the execute time for matmul, with my changes it is more faster than without my changes.


./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --wtag=any 4096x4096:4096x4096

0:PASSED (750 ms) __REPRO: --matmul --dt=u8:s8:u8 4096x4096:4096x4096
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.75s; create_pd: 0.00s (0%); create_prim: 0.00s (0%); fill: 0.37s (49%); execute: 0.11s (14%); compute_ref: 0.20s (27%); compare: 0.07s (9%);

If you can see the execute time for gemm kernel for both cases, it is 1.9x faster than earlier onednn. Like:

Gemm Output Size With my changes Without My changes
8192x8192 0.36s 0.69s
4096x4096 0.05s 0.11s

I have run thorugh the pytorch and vllm as well and it is giving me the performance boost.

Please let me know if anything more is required.

Tiwari-Avanish avatar May 18 '25 18:05 Tiwari-Avanish

@Tiwari-Avanish Those 2 cases are not very representative of overall DL performance.

We have prepared some batch files with cases extracted directly from some models that would be a better way of measuring performance difference. If you take a look at the benchdnn/inputs directory you will see a lot of different files for most drivers, e.g. for matmul you should see:

spalicki@localhost:~/workspace/oneDNN/build$ ls ./tests/benchdnn/inputs/matmul/ | grep perf_matmul
perf_matmul_inference_batched
perf_matmul_inference_lb
perf_matmul_training

You can run them as e.g. ./tests/benchdnn/benchdnn --matmul --dt=u8:s8:u8 --mode=P --batch=./tests/benchdnn/inputs/matmul/perf_matmul_training, the --mode=P option is used for measuring performance (i.e. skips correctness validation and such). Make sure that you use program like numactl and parallel to run low batch parallel workloads and get accurate data. It can be done in a script similar to (LB usually means large batch - to be run on a full socket - SB is small batch - to be run on a few cores):

#!/bin/bash
LB=("perf_matmul_inference_lb" "perf_matmul_training" "custom batch file for full socket testing")
SB=("harness_matmul_bert_inf_sb_int8" "custom batch file for limited core testing")

for CASE in "${LB[@]}"
do
    echo $CASE
    # Run each case on full socket 0
    OMP_PROC_BIND=spread OMP_PLACES=threads numactl --membind=0 --cpunodebind=0 benchdnn --mode=P --matmul --batch=inputs/matmul/${CASE} "&>" ${CASE}.csv
done

for CASE in "${SB[@]}"
do
    echo $CASE
    # Run each case in parallel on a subset of available cores defined in cores_sb.txt file
    cat cores_sb.txt | parallel --colsep ' ' -j7 KMP_HW_SUBSET=1T OMP_PROC_BIND=close OMP_PLACES=threads numactl --membind={1} --physcpubind={2} ./tests/benchdnn/benchdnn --mode=P --matmul --batch=./tests/benchdnn/inputs/matmul/${CASE} "&>" ${CASE}_{2}.csv
done

where cores_sb.txt is a file with a list of sockets and cores on those sockets to be used by each instance:

0 0-3
0 4-7
0 8-11
0 12-15
0 16-19
0 20-23
0 24-27

This will give you csv files that you can load to spreadsheet or pandas and then easily compare with each other. This script can be compressed to several lines if using tools like parallel, but it is easier to explain what I mean this way. More on the subject here: dev_guide_performance_settings.

Also, if you add -DDNNL_TEST_SET=NIGHTLY to your cmake command it will perform more thorough functional/correctness tests when running ctest, e.g. before pushing any code you can check it with:

cmake .. <flags you usually put in> -DDNNL_TEST_SET=NIGHTLY
cmake --build . --parallel
ctest

spalicki avatar May 19 '25 22:05 spalicki

Thanks @spalicki for giving me the step to collect the perf data. This is my first time to work with onednn so, it is taking some time. I will collect these data, just need to figure it out about numactl for power.

Tiwari-Avanish avatar May 20 '25 09:05 Tiwari-Avanish

Hi @spalicki ,

So, i have run the onednn as you have mentioned above. With -DDNNL_TEST_SET=NIGHTLY arguments 6-7 tests cases was failing, but now that got fixed and i have push those changes as well.

Tests Log:

Output logs of ctest
        Start   1: cpu-bnorm-u8-via-binary-postops-cpp
  1/335 Test   #1: cpu-bnorm-u8-via-binary-postops-cpp .....................   Passed    0.02 sec
        Start   2: cpu-cnn-inference-f32-c
  2/335 Test   #2: cpu-cnn-inference-f32-c .................................   Passed    0.03 sec
        Start   3: cpu-cnn-inference-f32-cpp
  3/335 Test   #3: cpu-cnn-inference-f32-cpp ...............................   Passed    1.31 sec
        Start   4: cpu-cnn-inference-int8-cpp
  4/335 Test   #4: cpu-cnn-inference-int8-cpp ..............................   Passed    0.06 sec
        Start   5: cpu-cnn-training-bf16-cpp
  5/335 Test   #5: cpu-cnn-training-bf16-cpp ...............................   Passed    0.05 sec
        Start   6: cpu-cnn-training-f32-cpp
  6/335 Test   #6: cpu-cnn-training-f32-cpp ................................   Passed    0.23 sec
        Start   7: cpu-cnn-training-f32-c
  7/335 Test   #7: cpu-cnn-training-f32-c ..................................   Passed    0.36 sec
        Start   8: cpu-matmul-coo-cpp
  8/335 Test   #8: cpu-matmul-coo-cpp ......................................   Passed    0.01 sec
        Start   9: cpu-matmul-csr-cpp
  9/335 Test   #9: cpu-matmul-csr-cpp ......................................   Passed    0.01 sec
        Start  10: cpu-matmul-weights-compression-cpp
 10/335 Test  #10: cpu-matmul-weights-compression-cpp ......................   Passed    0.04 sec
        Start  11: cpu-rnn-inference-f32-cpp
 11/335 Test  #11: cpu-rnn-inference-f32-cpp ...............................   Passed    0.33 sec
        Start  12: cpu-rnn-inference-int8-cpp
 12/335 Test  #12: cpu-rnn-inference-int8-cpp ..............................   Passed    0.01 sec
        Start  13: cpu-getting-started-cpp
 13/335 Test  #13: cpu-getting-started-cpp .................................   Passed    0.01 sec
        Start  14: cpu-graph-getting-started-cpp
 14/335 Test  #14: cpu-graph-getting-started-cpp ...........................   Passed    0.05 sec
        Start  15: cpu-graph-inference-int8-cpp
 15/335 Test  #15: cpu-graph-inference-int8-cpp ............................   Passed    0.04 sec
        Start  16: cpu-graph-single-op-partition-cpp
 16/335 Test  #16: cpu-graph-single-op-partition-cpp .......................   Passed    0.01 sec
        Start  17: cpu-graph-gated-mlp-cpp
 17/335 Test  #17: cpu-graph-gated-mlp-cpp .................................   Passed    1.60 sec
        Start  18: cpu-graph-gated-mlp-int4-cpp
 18/335 Test  #18: cpu-graph-gated-mlp-int4-cpp ............................   Passed    3.18 sec
        Start  19: cpu-graph-gated-mlp-wei-combined-cpp
 19/335 Test  #19: cpu-graph-gated-mlp-wei-combined-cpp ....................   Passed    1.40 sec
        Start  20: cpu-graph-gqa-cpp
 20/335 Test  #20: cpu-graph-gqa-cpp .......................................   Passed    3.35 sec
        Start  21: cpu-graph-mqa-cpp
 21/335 Test  #21: cpu-graph-mqa-cpp .......................................   Passed    3.39 sec
        Start  22: cpu-graph-sdpa-cpp
 22/335 Test  #22: cpu-graph-sdpa-cpp ......................................   Passed    6.58 sec
        Start  23: cpu-graph-sdpa-stacked-qkv-cpp
 23/335 Test  #23: cpu-graph-sdpa-stacked-qkv-cpp ..........................   Passed    3.81 sec
        Start  24: cpu-matmul-perf-cpp
 24/335 Test  #24: cpu-matmul-perf-cpp .....................................   Passed    7.56 sec
        Start  25: cpu-memory-format-propagation-cpp
 25/335 Test  #25: cpu-memory-format-propagation-cpp .......................   Passed    0.01 sec
        Start  26: cpu-performance-profiling-cpp
 26/335 Test  #26: cpu-performance-profiling-cpp ...........................   Passed    0.32 sec
        Start  27: cpu-primitives-augru-cpp
 27/335 Test  #27: cpu-primitives-augru-cpp ................................   Passed    0.05 sec
        Start  28: cpu-primitives-batch-normalization-cpp
 28/335 Test  #28: cpu-primitives-batch-normalization-cpp ..................   Passed    0.02 sec
        Start  29: cpu-primitives-binary-cpp
 29/335 Test  #29: cpu-primitives-binary-cpp ...............................   Passed    0.01 sec
        Start  30: cpu-primitives-concat-cpp
 30/335 Test  #30: cpu-primitives-concat-cpp ...............................   Passed    0.01 sec
        Start  31: cpu-primitives-convolution-cpp
 31/335 Test  #31: cpu-primitives-convolution-cpp ..........................   Passed    0.01 sec
        Start  32: cpu-primitives-deconvolution-cpp
 32/335 Test  #32: cpu-primitives-deconvolution-cpp ........................   Passed    0.02 sec
        Start  33: cpu-primitives-eltwise-cpp
 33/335 Test  #33: cpu-primitives-eltwise-cpp ..............................   Passed    0.02 sec
        Start  34: cpu-primitives-group-normalization-cpp

34/335 Test  #34: cpu-primitives-group-normalization-cpp ..................   Passed    0.31 sec
        Start  35: cpu-primitives-inner-product-cpp
 35/335 Test  #35: cpu-primitives-inner-product-cpp ........................   Passed    0.21 sec
        Start  36: cpu-primitives-layer-normalization-cpp
 36/335 Test  #36: cpu-primitives-layer-normalization-cpp ..................   Passed    0.01 sec
        Start  37: cpu-primitives-lbr-gru-cpp
 37/335 Test  #37: cpu-primitives-lbr-gru-cpp ..............................   Passed    0.02 sec
        Start  38: cpu-primitives-lrn-cpp
 38/335 Test  #38: cpu-primitives-lrn-cpp ..................................   Passed    0.02 sec
        Start  39: cpu-primitives-lstm-cpp
 39/335 Test  #39: cpu-primitives-lstm-cpp .................................   Passed    0.08 sec
        Start  40: cpu-primitives-matmul-cpp
 40/335 Test  #40: cpu-primitives-matmul-cpp ...............................   Passed    0.02 sec
        Start  41: cpu-primitives-pooling-cpp
 41/335 Test  #41: cpu-primitives-pooling-cpp ..............................   Passed    0.01 sec
        Start  42: cpu-primitives-prelu-cpp
 42/335 Test  #42: cpu-primitives-prelu-cpp ................................   Passed    0.02 sec
        Start  43: cpu-primitives-reduction-cpp
 43/335 Test  #43: cpu-primitives-reduction-cpp ............................   Passed    0.02 sec
        Start  44: cpu-primitives-reorder-cpp
 44/335 Test  #44: cpu-primitives-reorder-cpp ..............................   Passed    0.02 sec
        Start  45: cpu-primitives-resampling-cpp
 45/335 Test  #45: cpu-primitives-resampling-cpp ...........................   Passed    0.02 sec
        Start  46: cpu-primitives-shuffle-cpp
 46/335 Test  #46: cpu-primitives-shuffle-cpp ..............................   Passed    0.15 sec
        Start  47: cpu-primitives-softmax-cpp
 47/335 Test  #47: cpu-primitives-softmax-cpp ..............................   Passed    0.01 sec
        Start  48: cpu-primitives-sum-cpp
 48/335 Test  #48: cpu-primitives-sum-cpp ..................................   Passed    0.02 sec
        Start  49: cpu-primitives-vanilla-rnn-cpp
 49/335 Test  #49: cpu-primitives-vanilla-rnn-cpp ..........................   Passed    0.02 sec
        Start  50: cpu-rnn-training-f32-cpp
 50/335 Test  #50: cpu-rnn-training-f32-cpp ................................   Passed    0.24 sec
        Start  51: cpu-tutorials-matmul-matmul-quantization-cpp
 51/335 Test  #51: cpu-tutorials-matmul-matmul-quantization-cpp ............   Passed    0.01 sec
        Start  52: cpu-tutorials-matmul-sgemm-and-matmul-cpp
 52/335 Test  #52: cpu-tutorials-matmul-sgemm-and-matmul-cpp ...............   Passed    0.01 sec
        Start  53: cpu-tutorials-matmul-inference-int8-matmul-cpp
 53/335 Test  #53: cpu-tutorials-matmul-inference-int8-matmul-cpp ..........   Passed    0.02 sec
        Start  54: cpu-tutorials-matmul-weights-decompression-matmul-cpp
 54/335 Test  #54: cpu-tutorials-matmul-weights-decompression-matmul-cpp ...   Passed    0.03 sec
        Start  55: api-c
 55/335 Test  #55: api-c ...................................................   Passed    0.22 sec
        Start  56: test_c_symbols-c
 56/335 Test  #56: test_c_symbols-c ........................................   Passed    0.01 sec
        Start  57: test_batch_normalization
 57/335 Test  #57: test_batch_normalization ................................   Passed    0.26 sec
        Start  58: test_binary
 58/335 Test  #58: test_binary .............................................   Passed    0.67 sec
        Start  59: test_concat
 59/335 Test  #59: test_concat .............................................   Passed    0.28 sec
        Start  60: test_concurrency
 60/335 Test  #60: test_concurrency ........................................   Passed    0.04 sec
        Start  61: test_convolution_backward_data_f32
 61/335 Test  #61: test_convolution_backward_data_f32 ......................   Passed    2.40 sec
        Start  62: test_convolution_backward_weights_f32
 62/335 Test  #62: test_convolution_backward_weights_f32 ...................   Passed    5.45 sec
        Start  63: test_convolution_eltwise_forward_f32
 63/335 Test  #63: test_convolution_eltwise_forward_f32 ....................   Passed    2.79 sec
        Start  64: test_convolution_eltwise_forward_x8s8f32s32
 64/335 Test  #64: test_convolution_eltwise_forward_x8s8f32s32 .............   Passed    0.88 sec
        Start  65: test_convolution_forward_f32
 65/335 Test  #65: test_convolution_forward_f32 ............................   Passed    2.24 sec
        Start  66: test_convolution_forward_u8s8fp
 66/335 Test  #66: test_convolution_forward_u8s8fp .........................   Passed    0.09 sec
        Start  67: test_convolution_forward_u8s8s32
67/335 Test  #67: test_convolution_forward_u8s8s32 ........................   Passed    0.09 sec
        Start  68: test_cross_engine_reorder
 68/335 Test  #68: test_cross_engine_reorder ...............................   Passed    0.01 sec
        Start  69: test_deconvolution
 69/335 Test  #69: test_deconvolution ......................................   Passed    0.32 sec
        Start  70: test_eltwise
 70/335 Test  #70: test_eltwise ............................................   Passed    0.38 sec
        Start  71: test_group_normalization
 71/335 Test  #71: test_group_normalization ................................   Passed    0.32 sec
        Start  72: test_iface_attr
 72/335 Test  #72: test_iface_attr .........................................   Passed    0.02 sec
        Start  73: test_iface_attr_quantization
 73/335 Test  #73: test_iface_attr_quantization ............................   Passed    0.01 sec
        Start  74: test_iface_binary_bcast
 74/335 Test  #74: test_iface_binary_bcast .................................   Passed    0.01 sec
        Start  75: test_iface_handle
 75/335 Test  #75: test_iface_handle .......................................   Passed    0.01 sec
        Start  76: test_iface_pd
 76/335 Test  #76: test_iface_pd ...........................................   Passed    0.01 sec
        Start  77: test_iface_pd_iter
 77/335 Test  #77: test_iface_pd_iter ......................................   Passed    0.01 sec
        Start  78: test_iface_primitive_cache
 78/335 Test  #78: test_iface_primitive_cache ..............................   Passed    0.01 sec
        Start  79: test_iface_runtime_dims
 79/335 Test  #79: test_iface_runtime_dims .................................   Passed    0.01 sec
        Start  80: test_iface_sparse
 80/335 Test  #80: test_iface_sparse .......................................   Passed    0.01 sec
        Start  81: test_iface_weights_format
 81/335 Test  #81: test_iface_weights_format ...............................   Passed    0.01 sec
        Start  82: test_iface_wino_convolution
 82/335 Test  #82: test_iface_wino_convolution .............................   Passed    0.01 sec
        Start  83: test_inner_product_backward_data
 83/335 Test  #83: test_inner_product_backward_data ........................   Passed    0.21 sec
        Start  84: test_inner_product_backward_weights
 84/335 Test  #84: test_inner_product_backward_weights .....................   Passed    0.33 sec
        Start  85: test_inner_product_forward
 85/335 Test  #85: test_inner_product_forward ..............................   Passed    0.40 sec
        Start  86: test_layer_normalization
 86/335 Test  #86: test_layer_normalization ................................   Passed    0.26 sec
        Start  87: test_lrn
 87/335 Test  #87: test_lrn ................................................   Passed    0.82 sec
        Start  88: test_matmul
 88/335 Test  #88: test_matmul .............................................   Passed    0.05 sec
        Start  89: test_persistent_cache_api
 89/335 Test  #89: test_persistent_cache_api ...............................   Passed    0.01 sec
        Start  90: test_pooling_backward
 90/335 Test  #90: test_pooling_backward ...................................   Passed    2.90 sec
        Start  91: test_pooling_forward
 91/335 Test  #91: test_pooling_forward ....................................   Passed    4.42 sec
        Start  92: test_prelu
 92/335 Test  #92: test_prelu ..............................................   Passed    0.12 sec
        Start  93: test_primitive_cache_mt
 93/335 Test  #93: test_primitive_cache_mt .................................   Passed    0.01 sec
        Start  94: test_reduction
 94/335 Test  #94: test_reduction ..........................................   Passed    0.05 sec
        Start  95: test_reorder
 95/335 Test  #95: test_reorder ............................................   Passed    1.04 sec
        Start  96: test_resampling
 96/335 Test  #96: test_resampling .........................................   Passed    0.19 sec
        Start  97: test_rnn_forward
 97/335 Test  #97: test_rnn_forward ........................................   Passed    0.34 sec
        Start  98: test_shuffle
 98/335 Test  #98: test_shuffle ............................................   Passed    0.13 secStart  99: test_softmax
 99/335 Test  #99: test_softmax ............................................   Passed    0.15 sec
        Start 100: test_sum
100/335 Test #100: test_sum ................................................   Passed    0.68 sec
        Start 101: test_convolution_format_any
101/335 Test #101: test_convolution_format_any .............................   Passed    0.01 sec
        Start 102: test_gemm_bf16bf16bf16
102/335 Test #102: test_gemm_bf16bf16bf16 ..................................   Passed    0.01 sec
        Start 103: test_gemm_bf16bf16f32
103/335 Test #103: test_gemm_bf16bf16f32 ...................................   Passed    0.01 sec
        Start 104: test_gemm_f16
104/335 Test #104: test_gemm_f16 ...........................................   Passed    0.01 sec
        Start 105: test_gemm_f16f16f32
105/335 Test #105: test_gemm_f16f16f32 .....................................   Passed    0.01 sec
        Start 106: test_gemm_f32
106/335 Test #106: test_gemm_f32 ...........................................   Passed    0.85 sec
        Start 107: test_gemm_s8s8s32
107/335 Test #107: test_gemm_s8s8s32 .......................................   Passed    0.40 sec
        Start 108: test_gemm_s8u8s32
108/335 Test #108: test_gemm_s8u8s32 .......................................   Passed    0.01 sec
        Start 109: test_gemm_u8s8s32
109/335 Test #109: test_gemm_u8s8s32 .......................................   Passed    0.35 sec
        Start 110: test_gemm_u8u8s32
110/335 Test #110: test_gemm_u8u8s32 .......................................   Passed    0.02 sec
        Start 111: test_global_scratchpad
111/335 Test #111: test_global_scratchpad ..................................   Passed    0.01 sec
        Start 112: test_ip_formats
112/335 Test #112: test_ip_formats .........................................   Passed    1.69 sec
        Start 113: test_reorder_formats
113/335 Test #113: test_reorder_formats ....................................   Passed    1.70 sec
        Start 114: test_api
114/335 Test #114: test_api ................................................   Passed    0.09 sec
        Start 115: test_internals_env_vars_dnnl
115/335 Test #115: test_internals_env_vars_dnnl ............................   Passed    0.01 sec
        Start 116: test_internals_env_vars_onednn
116/335 Test #116: test_internals_env_vars_onednn ..........................   Passed    0.01 sec
        Start 117: test_internals
117/335 Test #117: test_internals ..........................................   Passed    0.04 sec
        Start 118: test_regression
118/335 Test #118: test_regression .........................................   Passed    0.01 sec
        Start 119: test_graph_c_api_add_op_cpu
119/335 Test #119: test_graph_c_api_add_op_cpu .............................   Passed    0.01 sec
        Start 120: test_graph_c_api_constant_cache_cpu
120/335 Test #120: test_graph_c_api_constant_cache_cpu .....................   Passed    0.01 sec
        Start 121: test_graph_c_api_filter_cpu
121/335 Test #121: test_graph_c_api_filter_cpu .............................   Passed    0.01 sec
        Start 122: test_graph_c_api_graph_cpu
122/335 Test #122: test_graph_c_api_graph_cpu ..............................   Passed    0.01 sec
        Start 123: test_graph_c_api_logical_tensor_cpu
123/335 Test #123: test_graph_c_api_logical_tensor_cpu .....................   Passed    0.01 sec
        Start 124: test_graph_c_api_op_cpu
124/335 Test #124: test_graph_c_api_op_cpu .................................   Passed    0.01 sec
        Start 125: test_graph_cpp_api_constant_cache_cpu
125/335 Test #125: test_graph_cpp_api_constant_cache_cpu ...................   Passed    0.01 sec
        Start 126: test_graph_cpp_api_engine_cpu
126/335 Test #126: test_graph_cpp_api_engine_cpu ...........................   Passed    0.02 sec
        Start 127: test_graph_cpp_api_graph_cpu
127/335 Test #127: test_graph_cpp_api_graph_cpu ............................   Passed    0.01 sec
        Start 128: test_graph_cpp_api_logical_tensor_cpu
128/335 Test #128: test_graph_cpp_api_logical_tensor_cpu ...................   Passed    0.01 sec
        Start 129: test_graph_cpp_api_op_cpu
129/335 Test #129: test_graph_cpp_api_op_cpu ...............................   Passed    0.01 sec
Start 130: test_graph_cpp_api_tensor_cpu
130/335 Test #130: test_graph_cpp_api_tensor_cpu ...........................   Passed    0.01 sec
        Start 131: test_graph_c_api_compile_cpu
131/335 Test #131: test_graph_c_api_compile_cpu ............................   Passed    0.01 sec
        Start 132: test_graph_c_api_compile_parametrized_cpu
132/335 Test #132: test_graph_c_api_compile_parametrized_cpu ...............   Passed    0.01 sec
        Start 133: test_graph_cpp_api_compile_cpu
133/335 Test #133: test_graph_cpp_api_compile_cpu ..........................   Passed    0.01 sec
        Start 134: test_graph_cpp_api_partition_cpu
134/335 Test #134: test_graph_cpp_api_partition_cpu ........................   Passed    0.01 sec
        Start 135: test_graph_unit_interface_allocator_cpu
135/335 Test #135: test_graph_unit_interface_allocator_cpu .................   Passed    0.01 sec
        Start 136: test_graph_unit_interface_compiled_partition_cpu
136/335 Test #136: test_graph_unit_interface_compiled_partition_cpu ........   Passed    0.01 sec
        Start 137: test_graph_unit_interface_partition_hashing_cpu
137/335 Test #137: test_graph_unit_interface_partition_hashing_cpu .........   Passed    0.01 sec
        Start 138: test_graph_unit_interface_tensor_cpu
138/335 Test #138: test_graph_unit_interface_tensor_cpu ....................   Passed    0.01 sec
        Start 139: test_graph_unit_interface_backend_cpu
139/335 Test #139: test_graph_unit_interface_backend_cpu ...................   Passed    0.01 sec
        Start 140: test_graph_unit_interface_graph_cpu
140/335 Test #140: test_graph_unit_interface_graph_cpu .....................   Passed    0.01 sec
        Start 141: test_graph_unit_interface_logical_tensor_cpu
141/335 Test #141: test_graph_unit_interface_logical_tensor_cpu ............   Passed    0.01 sec
        Start 142: test_graph_unit_interface_op_cpu
142/335 Test #142: test_graph_unit_interface_op_cpu ........................   Passed    0.01 sec
        Start 143: test_graph_unit_interface_op_def_constraint_cpu
143/335 Test #143: test_graph_unit_interface_op_def_constraint_cpu .........   Passed    0.01 sec
        Start 144: test_graph_unit_interface_op_schema_cpu
144/335 Test #144: test_graph_unit_interface_op_schema_cpu .................   Passed    0.01 sec
        Start 145: test_graph_unit_interface_shape_infer_cpu
145/335 Test #145: test_graph_unit_interface_shape_infer_cpu ...............   Passed    0.01 sec
        Start 146: test_graph_unit_interface_value_cpu
146/335 Test #146: test_graph_unit_interface_value_cpu .....................   Passed    0.01 sec
        Start 147: test_graph_unit_fake_cpu
147/335 Test #147: test_graph_unit_fake_cpu ................................   Passed    0.01 sec
        Start 148: test_graph_unit_dnnl_dnnl_infer_shape_cpu
148/335 Test #148: test_graph_unit_dnnl_dnnl_infer_shape_cpu ...............   Passed    0.01 sec
        Start 149: test_graph_unit_dnnl_dnnl_utils_cpu
149/335 Test #149: test_graph_unit_dnnl_dnnl_utils_cpu .....................   Passed    0.01 sec
        Start 150: test_graph_unit_dnnl_fusion_info_cpu
150/335 Test #150: test_graph_unit_dnnl_fusion_info_cpu ....................   Passed    0.01 sec
        Start 151: test_graph_unit_dnnl_graph_cpu
151/335 Test #151: test_graph_unit_dnnl_graph_cpu ..........................   Passed    0.01 sec
        Start 152: test_graph_unit_dnnl_insert_ops_cpu
152/335 Test #152: test_graph_unit_dnnl_insert_ops_cpu .....................   Passed    0.01 sec
        Start 153: test_graph_unit_dnnl_internal_attrs_cpu
153/335 Test #153: test_graph_unit_dnnl_internal_attrs_cpu .................   Passed    0.01 sec
        Start 154: test_graph_unit_dnnl_layout_id_cpu
154/335 Test #154: test_graph_unit_dnnl_layout_id_cpu ......................   Passed    0.01 sec
        Start 155: test_graph_unit_dnnl_layout_propagator_cpu
155/335 Test #155: test_graph_unit_dnnl_layout_propagator_cpu ..............   Passed    0.01 sec
        Start 156: test_graph_unit_dnnl_logical_tensor_cpu
156/335 Test #156: test_graph_unit_dnnl_logical_tensor_cpu .................   Passed    0.01 sec
        Start 157: test_graph_unit_dnnl_memory_planning_cpu
157/335 Test #157: test_graph_unit_dnnl_memory_planning_cpu ................   Passed    0.01 sec
        Start 158: test_graph_unit_dnnl_op_schema_cpu
158/335 Test #158: test_graph_unit_dnnl_op_schema_cpu ......................   Passed    0.01 sec
        Start 159: test_graph_unit_dnnl_partition_cpu
159/335 Test #159: test_graph_unit_dnnl_partition_cpu ......................   Passed    0.01 sec
        Start 160: test_graph_unit_dnnl_thread_local_cache_cpu
160/335 Test #160: test_graph_unit_dnnl_thread_local_cache_cpu .............   Passed    0.01 sec
        Start 161: test_graph_unit_dnnl_batch_norm_cpu
161/335 Test #161: test_graph_unit_dnnl_batch_norm_cpu .....................   Passed    0.13 sec
        Start 162: test_graph_unit_dnnl_binary_op_cpu
162/335 Test #162: test_graph_unit_dnnl_binary_op_cpu ......................   Passed    0.18 sec
        Start 163: test_graph_unit_dnnl_bmm_cpu
163/335 Test #163: test_graph_unit_dnnl_bmm_cpu ............................   Passed    0.11 sec
        Start 164: test_graph_unit_dnnl_common_cpu
164/335 Test #164: test_graph_unit_dnnl_common_cpu .........................   Passed    0.01 sec
        Start 165: test_graph_unit_dnnl_compiled_partition_cpu
165/335 Test #165: test_graph_unit_dnnl_compiled_partition_cpu .............   Passed    0.02 sec
        Start 166: test_graph_unit_dnnl_concat_cpu
166/335 Test #166: test_graph_unit_dnnl_concat_cpu .........................   Passed    0.10 sec
        Start 167: test_graph_unit_dnnl_constant_cache_cpu
167/335 Test #167: test_graph_unit_dnnl_constant_cache_cpu .................   Passed    0.01 sec
        Start 168: test_graph_unit_dnnl_convolution_cpu
168/335 Test #168: test_graph_unit_dnnl_convolution_cpu ....................   Passed    1.60 sec
        Start 169: test_graph_unit_dnnl_convtranspose_cpu
169/335 Test #169: test_graph_unit_dnnl_convtranspose_cpu ..................   Passed    1.41 sec
        Start 170: test_graph_unit_dnnl_dequantize_cpu
170/335 Test #170: test_graph_unit_dnnl_dequantize_cpu .....................   Passed    0.03 sec
        Start 171: test_graph_unit_dnnl_eltwise_cpu
171/335 Test #171: test_graph_unit_dnnl_eltwise_cpu ........................   Passed    0.11 sec
        Start 172: test_graph_unit_dnnl_group_norm_cpu
172/335 Test #172: test_graph_unit_dnnl_group_norm_cpu .....................   Passed    0.02 sec
        Start 173: test_graph_unit_dnnl_interpolate_cpu
173/335 Test #173: test_graph_unit_dnnl_interpolate_cpu ....................   Passed    0.05 sec
        Start 174: test_graph_unit_dnnl_large_partition_cpu
174/335 Test #174: test_graph_unit_dnnl_large_partition_cpu ................   Passed    0.36 sec
        Start 175: test_graph_unit_dnnl_layer_norm_cpu
175/335 Test #175: test_graph_unit_dnnl_layer_norm_cpu .....................   Passed    0.03 sec
        Start 176: test_graph_unit_dnnl_matmul_cpu
176/335 Test #176: test_graph_unit_dnnl_matmul_cpu .........................   Passed    3.65 sec
        Start 177: test_graph_unit_dnnl_mqa_decomp_cpu
177/335 Test #177: test_graph_unit_dnnl_mqa_decomp_cpu .....................   Passed   42.63 sec
        Start 178: test_graph_unit_dnnl_op_executable_cpu
178/335 Test #178: test_graph_unit_dnnl_op_executable_cpu ..................   Passed    0.01 sec
        Start 179: test_graph_unit_dnnl_pass_cpu
179/335 Test #179: test_graph_unit_dnnl_pass_cpu ...........................   Passed    0.05 sec
        Start 180: test_graph_unit_dnnl_pool_cpu
180/335 Test #180: test_graph_unit_dnnl_pool_cpu ...........................   Passed    0.28 sec
        Start 181: test_graph_unit_dnnl_prelu_cpu
181/335 Test #181: test_graph_unit_dnnl_prelu_cpu ..........................   Passed    0.04 sec
        Start 182: test_graph_unit_dnnl_quantize_cpu
182/335 Test #182: test_graph_unit_dnnl_quantize_cpu .......................   Passed    0.03 sec
        Start 183: test_graph_unit_dnnl_reduce_cpu
183/335 Test #183: test_graph_unit_dnnl_reduce_cpu .........................   Passed    0.21 sec
        Start 184: test_graph_unit_dnnl_reorder_cpu
184/335 Test #184: test_graph_unit_dnnl_reorder_cpu ........................   Passed    0.02 sec
        Start 185: test_graph_unit_dnnl_scratchpad_cpu
185/335 Test #185: test_graph_unit_dnnl_scratchpad_cpu .....................   Passed    0.01 sec
        Start 186: test_graph_unit_dnnl_sdp_decomp_cpu
186/335 Test #186: test_graph_unit_dnnl_sdp_decomp_cpu .....................   Passed   15.21 sec
        Start 187: test_graph_unit_dnnl_select_cpu
187/335 Test #187: test_graph_unit_dnnl_select_cpu .........................   Passed    0.02 sec
        Start 188: test_graph_unit_dnnl_softmax_cpu
                                                                                                           188/335 Test #188: test_graph_unit_dnnl_softmax_cpu ........................   Passed    0.02 sec
        Start 189: test_graph_unit_dnnl_subgraph_pass_cpu
189/335 Test #189: test_graph_unit_dnnl_subgraph_pass_cpu ..................   Passed    0.11 sec
        Start 190: test_graph_unit_dnnl_typecast_cpu
190/335 Test #190: test_graph_unit_dnnl_typecast_cpu .......................   Passed    0.02 sec
        Start 191: test_graph_unit_utils_allocator_cpu
191/335 Test #191: test_graph_unit_utils_allocator_cpu .....................   Passed    0.01 sec
        Start 192: test_graph_unit_utils_attribute_value_cpu
192/335 Test #192: test_graph_unit_utils_attribute_value_cpu ...............   Passed    0.01 sec
        Start 193: test_graph_unit_utils_debug_cpu
193/335 Test #193: test_graph_unit_utils_debug_cpu .........................   Passed    0.01 sec
        Start 194: test_graph_unit_utils_json_cpu
194/335 Test #194: test_graph_unit_utils_json_cpu ..........................   Passed    0.01 sec
        Start 195: test_graph_unit_utils_pattern_matcher_cpu
195/335 Test #195: test_graph_unit_utils_pattern_matcher_cpu ...............   Passed    0.01 sec
        Start 196: test_graph_unit_utils_utils_cpu
196/335 Test #196: test_graph_unit_utils_utils_cpu .........................   Passed    0.01 sec
        Start 197: test_benchdnn_modeC_binary_all_cpu
197/335 Test #197: test_benchdnn_modeC_binary_all_cpu ......................   Passed    8.11 sec
        Start 198: test_benchdnn_modeC_binary_bfloat16_cpu
198/335 Test #198: test_benchdnn_modeC_binary_bfloat16_cpu .................   Passed    0.14 sec
        Start 199: test_benchdnn_modeC_binary_float16_cpu
199/335 Test #199: test_benchdnn_modeC_binary_float16_cpu ..................   Passed    0.10 sec
        Start 200: test_benchdnn_modeC_bnorm_all_blocked_cpu
200/335 Test #200: test_benchdnn_modeC_bnorm_all_blocked_cpu ...............   Passed    0.10 sec
        Start 201: test_benchdnn_modeC_bnorm_all_plain_cpu
201/335 Test #201: test_benchdnn_modeC_bnorm_all_plain_cpu .................   Passed   57.50 sec
        Start 202: test_benchdnn_modeC_bnorm_bfloat16_blocked_cpu
202/335 Test #202: test_benchdnn_modeC_bnorm_bfloat16_blocked_cpu ..........   Passed    0.05 sec
        Start 203: test_benchdnn_modeC_bnorm_bfloat16_plain_cpu
203/335 Test #203: test_benchdnn_modeC_bnorm_bfloat16_plain_cpu ............   Passed    0.05 sec
        Start 204: test_benchdnn_modeC_bnorm_float16_plain_cpu
204/335 Test #204: test_benchdnn_modeC_bnorm_float16_plain_cpu .............   Passed    0.05 sec
        Start 205: test_benchdnn_modeC_bnorm_regressions_cpu
205/335 Test #205: test_benchdnn_modeC_bnorm_regressions_cpu ...............   Passed   43.51 sec
        Start 206: test_benchdnn_modeC_bnorm_regressions_large_cpu
206/335 Test #206: test_benchdnn_modeC_bnorm_regressions_large_cpu .........   Passed  110.65 sec
        Start 207: test_benchdnn_modeC_brgemm_bf16_cpu
207/335 Test #207: test_benchdnn_modeC_brgemm_bf16_cpu .....................   Passed    0.01 sec
        Start 208: test_benchdnn_modeC_brgemm_f16_cpu
208/335 Test #208: test_benchdnn_modeC_brgemm_f16_cpu ......................   Passed    0.01 sec
        Start 209: test_benchdnn_modeC_brgemm_f32_cpu
209/335 Test #209: test_benchdnn_modeC_brgemm_f32_cpu ......................   Passed    0.01 sec
        Start 210: test_benchdnn_modeC_brgemm_f8_cpu
210/335 Test #210: test_benchdnn_modeC_brgemm_f8_cpu .......................   Passed    0.01 sec
        Start 211: test_benchdnn_modeC_brgemm_int8_cpu
211/335 Test #211: test_benchdnn_modeC_brgemm_int8_cpu .....................   Passed    0.01 sec
        Start 212: test_benchdnn_modeC_brgemm_regression_cpu
212/335 Test #212: test_benchdnn_modeC_brgemm_regression_cpu ...............   Passed    0.01 sec
        Start 213: test_benchdnn_modeC_concat_all_cpu
213/335 Test #213: test_benchdnn_modeC_concat_all_cpu ......................   Passed    2.05 sec
        Start 214: test_benchdnn_modeC_concat_bfloat16_cpu
214/335 Test #214: test_benchdnn_modeC_concat_bfloat16_cpu .................   Passed    0.22 sec
        Start 215: test_benchdnn_modeC_concat_float16_cpu
215/335 Test #215: test_benchdnn_modeC_concat_float16_cpu ..................   Passed    0.25 sec
        Start 216: test_benchdnn_modeC_conv_3d_cpu
216/335 Test #216: test_benchdnn_modeC_conv_3d_cpu .........................   Passed    0.40 sec
        Start 217: test_benchdnn_modeC_conv_3d_f32_plain_cpu
217/335 Test #217: test_benchdnn_modeC_conv_3d_f32_plain_cpu ...............   Passed    0.54 sec
        Start 218: test_benchdnn_modeC_conv_all_topologies_cpu
218/335 Test #218: test_benchdnn_modeC_conv_all_topologies_cpu .............   Passed    1.45 sec
        Start 219: test_benchdnn_modeC_conv_all_topologies_f32_plain_cpu
219/335 Test #219: test_benchdnn_modeC_conv_all_topologies_f32_plain_cpu ...   Passed    2.84 sec
        Start 220: test_benchdnn_modeC_conv_attrs_cpu
220/335 Test #220: test_benchdnn_modeC_conv_attrs_cpu ......................   Passed   40.76 sec
        Start 221: test_benchdnn_modeC_conv_attrs_f32_plain_cpu
221/335 Test #221: test_benchdnn_modeC_conv_attrs_f32_plain_cpu ............   Passed    0.29 sec
        Start 222: test_benchdnn_modeC_conv_bfloat16_cpu
222/335 Test #222: test_benchdnn_modeC_conv_bfloat16_cpu ...................   Passed    1.23 sec
        Start 223: test_benchdnn_modeC_conv_bfloat16_nxc_cpu
223/335 Test #223: test_benchdnn_modeC_conv_bfloat16_nxc_cpu ...............   Passed    1.60 sec
        Start 224: test_benchdnn_modeC_conv_bfloat16_ymm_cpu
224/335 Test #224: test_benchdnn_modeC_conv_bfloat16_ymm_cpu ...............   Passed    0.33 sec
        Start 225: test_benchdnn_modeC_conv_depthwise_cpu
225/335 Test #225: test_benchdnn_modeC_conv_depthwise_cpu ..................   Passed    3.81 sec
        Start 226: test_benchdnn_modeC_conv_dilated_cpu
226/335 Test #226: test_benchdnn_modeC_conv_dilated_cpu ....................   Passed    9.00 sec
        Start 227: test_benchdnn_modeC_conv_dilated_f32_plain_cpu
227/335 Test #227: test_benchdnn_modeC_conv_dilated_f32_plain_cpu ..........   Passed    2.87 sec
        Start 228: test_benchdnn_modeC_conv_dt_cpu
228/335 Test #228: test_benchdnn_modeC_conv_dt_cpu .........................   Passed  294.57 sec
        Start 229: test_benchdnn_modeC_conv_dt_plain_cpu
229/335 Test #229: test_benchdnn_modeC_conv_dt_plain_cpu ...................   Passed    7.16 sec
        Start 230: test_benchdnn_modeC_conv_float16_nxc_cpu
230/335 Test #230: test_benchdnn_modeC_conv_float16_nxc_cpu ................   Passed    1.57 sec
        Start 231: test_benchdnn_modeC_conv_fp4_cpu
231/335 Test #231: test_benchdnn_modeC_conv_fp4_cpu ........................   Passed    0.02 sec
        Start 232: test_benchdnn_modeC_conv_fp8_nxc_cpu
232/335 Test #232: test_benchdnn_modeC_conv_fp8_nxc_cpu ....................   Passed    2.64 sec
        Start 233: test_benchdnn_modeC_conv_function_cpu
233/335 Test #233: test_benchdnn_modeC_conv_function_cpu ...................   Passed    8.01 sec
        Start 234: test_benchdnn_modeC_conv_gemm_bfloat16_cpu
234/335 Test #234: test_benchdnn_modeC_conv_gemm_bfloat16_cpu ..............   Passed    0.17 sec
        Start 235: test_benchdnn_modeC_conv_gemm_bfloat16_nxc_cpu
235/335 Test #235: test_benchdnn_modeC_conv_gemm_bfloat16_nxc_cpu ..........   Passed    0.17 sec
        Start 236: test_benchdnn_modeC_conv_gemm_dt_cpu
236/335 Test #236: test_benchdnn_modeC_conv_gemm_dt_cpu ....................   Passed   42.18 sec
        Start 237: test_benchdnn_modeC_conv_gemm_dt_nxc_cpu
237/335 Test #237: test_benchdnn_modeC_conv_gemm_dt_nxc_cpu ................   Passed  378.95 sec
        Start 238: test_benchdnn_modeC_conv_gemm_int8_cpu
238/335 Test #238: test_benchdnn_modeC_conv_gemm_int8_cpu ..................   Passed    0.24 sec
        Start 239: test_benchdnn_modeC_conv_int8_cpu
239/335 Test #239: test_benchdnn_modeC_conv_int8_cpu .......................   Passed  628.93 sec
        Start 240: test_benchdnn_modeC_conv_regression_cpu
240/335 Test #240: test_benchdnn_modeC_conv_regression_cpu .................   Passed    8.84 sec
        Start 241: test_benchdnn_modeC_conv_wino_f32_cpu
241/335 Test #241: test_benchdnn_modeC_conv_wino_f32_cpu ...................   Passed    0.67 sec
        Start 242: test_benchdnn_modeC_deconv_all_cpu
242/335 Test #242: test_benchdnn_modeC_deconv_all_cpu ......................   Passed    1.18 sec
        Start 243: test_benchdnn_modeC_deconv_all_f32_nxc_cpu
243/335 Test #243: test_benchdnn_modeC_deconv_all_f32_nxc_cpu ..............   Passed    0.19 sec
        Start 244: test_benchdnn_modeC_deconv_bfloat16_cpu
244/335 Test #244: test_benchdnn_modeC_deconv_bfloat16_cpu .................   Passed    0.27 sec
        Start 245: test_benchdnn_modeC_deconv_bfloat16_nxc_cpu
245/335 Test #245: test_benchdnn_modeC_deconv_bfloat16_nxc_cpu .............   Passed    0.32 sec
        Start 246: test_benchdnn_modeC_deconv_bfloat16_ymm_cpu
246/335 Test #246: test_benchdnn_modeC_deconv_bfloat16_ymm_cpu .............   Passed    0.27 sec
        Start 247: test_benchdnn_modeC_deconv_float16_nxc_cpu
247/335 Test #247: test_benchdnn_modeC_deconv_float16_nxc_cpu ..............   Passed    0.30 sec
        Start 248: test_benchdnn_modeC_deconv_fp8_nxc_cpu
248/335 Test #248: test_benchdnn_modeC_deconv_fp8_nxc_cpu ..................   Passed    0.52 sec
        Start 249: test_benchdnn_modeC_deconv_int8_cpu
249/335 Test #249: test_benchdnn_modeC_deconv_int8_cpu .....................   Passed    0.48 sec
        Start 250: test_benchdnn_modeC_eltwise_all_cpu
250/335 Test #250: test_benchdnn_modeC_eltwise_all_cpu .....................   Passed    1.74 sec
        Start 251: test_benchdnn_modeC_eltwise_bfloat16_cpu
251/335 Test #251: test_benchdnn_modeC_eltwise_bfloat16_cpu ................   Passed    0.37 sec
        Start 252: test_benchdnn_modeC_eltwise_float16_cpu
252/335 Test #252: test_benchdnn_modeC_eltwise_float16_cpu .................   Passed    0.16 sec
        Start 253: test_benchdnn_modeC_eltwise_float8_cpu
253/335 Test #253: test_benchdnn_modeC_eltwise_float8_cpu ..................   Passed    0.30 sec
        Start 254: test_benchdnn_modeC_gnorm_all_cpu
254/335 Test #254: test_benchdnn_modeC_gnorm_all_cpu .......................   Passed   64.30 sec
        Start 255: test_benchdnn_modeC_graph_bf16_cpu
255/335 Test #255: test_benchdnn_modeC_graph_bf16_cpu ......................   Passed    0.23 sec
        Start 256: test_benchdnn_modeC_graph_f16_cpu
256/335 Test #256: test_benchdnn_modeC_graph_f16_cpu .......................   Passed    0.22 sec
        Start 257: test_benchdnn_modeC_graph_f32_cpu
257/335 Test #257: test_benchdnn_modeC_graph_f32_cpu .......................   Passed  199.10 sec
        Start 258: test_benchdnn_modeC_graph_f8_cpu
258/335 Test #258: test_benchdnn_modeC_graph_f8_cpu ........................   Passed    1.18 sec
        Start 259: test_benchdnn_modeC_graph_fusions_cpu
259/335 Test #259: test_benchdnn_modeC_graph_fusions_cpu ...................   Passed   59.04 sec
        Start 260: test_benchdnn_modeC_graph_int8_cpu
260/335 Test #260: test_benchdnn_modeC_graph_int8_cpu ......................   Passed   14.91 sec
        Start 261: test_benchdnn_modeC_ip_acl_cpu
261/335 Test #261: test_benchdnn_modeC_ip_acl_cpu ..........................   Passed    0.08 sec
        Start 262: test_benchdnn_modeC_ip_all_cpu
262/335 Test #262: test_benchdnn_modeC_ip_all_cpu ..........................   Passed   91.62 sec
        Start 263: test_benchdnn_modeC_ip_bf32_bfloat16_cpu
263/335 Test #263: test_benchdnn_modeC_ip_bf32_bfloat16_cpu ................   Passed    0.16 sec
        Start 264: test_benchdnn_modeC_ip_bfloat16_cpu
264/335 Test #264: test_benchdnn_modeC_ip_bfloat16_cpu .....................   Passed    0.13 sec
        Start 265: test_benchdnn_modeC_ip_bfloat16_ymm_cpu
265/335 Test #265: test_benchdnn_modeC_ip_bfloat16_ymm_cpu .................   Passed    0.13 sec
        Start 266: test_benchdnn_modeC_ip_float16_cpu
266/335 Test #266: test_benchdnn_modeC_ip_float16_cpu ......................   Passed    0.13 sec
        Start 267: test_benchdnn_modeC_ip_fp8_cpu
267/335 Test #267: test_benchdnn_modeC_ip_fp8_cpu ..........................   Passed    0.11 sec
        Start 268: test_benchdnn_modeC_ip_int8_cpu
268/335 Test #268: test_benchdnn_modeC_ip_int8_cpu .........................   Passed   73.15 sec
        Start 269: test_benchdnn_modeC_lnorm_all_cpu
269/335 Test #269: test_benchdnn_modeC_lnorm_all_cpu .......................   Passed  105.01 sec
        Start 270: test_benchdnn_modeC_lnorm_bfloat16_cpu
270/335 Test #270: test_benchdnn_modeC_lnorm_bfloat16_cpu ..................   Passed    0.08 sec
        Start 271: test_benchdnn_modeC_lnorm_float16_cpu
271/335 Test #271: test_benchdnn_modeC_lnorm_float16_cpu ...................   Passed    0.07 sec
        Start 272: test_benchdnn_modeC_lnorm_int8_cpu
272/335 Test #272: test_benchdnn_modeC_lnorm_int8_cpu ......................   Passed   90.44 sec
        Start 273: test_benchdnn_modeC_lrn_all_cpu
273/335 Test #273: test_benchdnn_modeC_lrn_all_cpu .........................   Passed   11.89 sec
        Start 274: test_benchdnn_modeC_lrn_bfloat16_cpu
274/335 Test #274: test_benchdnn_modeC_lrn_bfloat16_cpu ....................   Passed    0.03 sec
        Start 275: test_benchdnn_modeC_lrn_float16_cpu
275/335 Test #275: test_benchdnn_modeC_lrn_float16_cpu .....................   Passed    0.02 sec
        Start 276: test_benchdnn_modeC_matmul_all_cpu
276/335 Test #276: test_benchdnn_modeC_matmul_all_cpu ......................   Passed  213.88 sec
        Start 277: test_benchdnn_modeC_matmul_bf32_bf16_cpu
277/335 Test #277: test_benchdnn_modeC_matmul_bf32_bf16_cpu ................   Passed    0.24 sec
        Start 278: test_benchdnn_modeC_matmul_bfloat16_cpu
278/335 Test #278: test_benchdnn_modeC_matmul_bfloat16_cpu .................   Passed   21.44 sec
        Start 279: test_benchdnn_modeC_matmul_bfloat16_ymm_cpu
279/335 Test #279: test_benchdnn_modeC_matmul_bfloat16_ymm_cpu .............   Passed   21.90 sec
        Start 280: test_benchdnn_modeC_matmul_float16_cpu
280/335 Test #280: test_benchdnn_modeC_matmul_float16_cpu ..................   Passed    4.89 sec
        Start 281: test_benchdnn_modeC_matmul_fp4_cpu
281/335 Test #281: test_benchdnn_modeC_matmul_fp4_cpu ......................   Passed    0.04 sec
        Start 282: test_benchdnn_modeC_matmul_fp8_cpu
282/335 Test #282: test_benchdnn_modeC_matmul_fp8_cpu ......................   Passed    0.51 sec
        Start 283: test_benchdnn_modeC_matmul_int8_cpu
283/335 Test #283: test_benchdnn_modeC_matmul_int8_cpu .....................   Passed   72.42 sec
        Start 284: test_benchdnn_modeC_matmul_multidims_cpu
284/335 Test #284: test_benchdnn_modeC_matmul_multidims_cpu ................   Passed  149.46 sec
        Start 285: test_benchdnn_modeC_matmul_sparse_cpu
285/335 Test #285: test_benchdnn_modeC_matmul_sparse_cpu ...................   Passed  242.91 sec
        Start 286: test_benchdnn_modeC_pool_all_cpu
286/335 Test #286: test_benchdnn_modeC_pool_all_cpu ........................   Passed   67.40 sec
        Start 287: test_benchdnn_modeC_pool_bfloat16_cpu
287/335 Test #287: test_benchdnn_modeC_pool_bfloat16_cpu ...................   Passed    0.22 sec
        Start 288: test_benchdnn_modeC_pool_float16_cpu
288/335 Test #288: test_benchdnn_modeC_pool_float16_cpu ....................   Passed    0.17 sec
        Start 289: test_benchdnn_modeC_pool_fp8_cpu
289/335 Test #289: test_benchdnn_modeC_pool_fp8_cpu ........................   Passed    0.29 sec
        Start 290: test_benchdnn_modeC_prelu_all_cpu
290/335 Test #290: test_benchdnn_modeC_prelu_all_cpu .......................   Passed  285.45 sec
        Start 291: test_benchdnn_modeC_prelu_bfloat16_cpu
291/335 Test #291: test_benchdnn_modeC_prelu_bfloat16_cpu ..................   Passed    0.07 sec
        Start 292: test_benchdnn_modeC_prelu_float16_cpu
292/335 Test #292: test_benchdnn_modeC_prelu_float16_cpu ...................   Passed    0.07 sec
        Start 293: test_benchdnn_modeC_reduction_all_cpu
293/335 Test #293: test_benchdnn_modeC_reduction_all_cpu ...................   Passed   15.41 sec
        Start 294: test_benchdnn_modeC_reduction_bfloat16_cpu
294/335 Test #294: test_benchdnn_modeC_reduction_bfloat16_cpu ..............   Passed    0.11 sec
        Start 295: test_benchdnn_modeC_reduction_float16_cpu
295/335 Test #295: test_benchdnn_modeC_reduction_float16_cpu ...............   Passed    0.11 sec
        Start 296: test_benchdnn_modeC_reorder_all_cpu
296/335 Test #296: test_benchdnn_modeC_reorder_all_cpu .....................   Passed   17.59 sec
        Start 297: test_benchdnn_modeC_reorder_bfloat16_cpu
297/335 Test #297: test_benchdnn_modeC_reorder_bfloat16_cpu ................   Passed    0.29 sec
        Start 298: test_benchdnn_modeC_reorder_float16_cpu
298/335 Test #298: test_benchdnn_modeC_reorder_float16_cpu .................   Passed    0.48 sec
        Start 299: test_benchdnn_modeC_reorder_float8_cpu
299/335 Test #299: test_benchdnn_modeC_reorder_float8_cpu ..................   Passed    0.24 sec
        Start 300: test_benchdnn_modeC_reorder_fp4_cpu
300/335 Test #300: test_benchdnn_modeC_reorder_fp4_cpu .....................   Passed    0.25 sec
        Start 301: test_benchdnn_modeC_reorder_int4_cpu
301/335 Test #301: test_benchdnn_modeC_reorder_int4_cpu ....................   Passed    0.35 sec
        Start 302: test_benchdnn_modeC_resampling_all_cpu
302/335 Test #302: test_benchdnn_modeC_resampling_all_cpu ..................   Passed   11.00 sec
        Start 303: test_benchdnn_modeC_resampling_bfloat16_cpu
303/335 Test #303: test_benchdnn_modeC_resampling_bfloat16_cpu .............   Passed    0.03 sec
        Start 304: test_benchdnn_modeC_resampling_float16_cpu
304/335 Test #304: test_benchdnn_modeC_resampling_float16_cpu ..............   Passed    0.02 sec
        Start 305: test_benchdnn_modeC_augru_all_cpu
305/335 Test #305: test_benchdnn_modeC_augru_all_cpu .......................   Passed    0.92 sec
        Start 306: test_benchdnn_modeC_augru_bf32_bfloat16_cpu
306/335 Test #306: test_benchdnn_modeC_augru_bf32_bfloat16_cpu .............   Passed    0.32 sec
        Start 307: test_benchdnn_modeC_augru_bfloat16_cpu
307/335 Test #307: test_benchdnn_modeC_augru_bfloat16_cpu ..................   Passed    0.02 sec
        Start 308: test_benchdnn_modeC_augru_float16_cpu
308/335 Test #308: test_benchdnn_modeC_augru_float16_cpu ...................   Passed    0.02 sec
        Start 309: test_benchdnn_modeC_gru_all_cpu
309/335 Test #309: test_benchdnn_modeC_gru_all_cpu .........................   Passed   91.48 sec
        Start 310: test_benchdnn_modeC_gru_bf32_bfloat16_cpu
310/335 Test #310: test_benchdnn_modeC_gru_bf32_bfloat16_cpu ...............   Passed   13.67 sec
        Start 311: test_benchdnn_modeC_gru_bfloat16_cpu
311/335 Test #311: test_benchdnn_modeC_gru_bfloat16_cpu ....................   Passed    0.09 sec
        Start 312: test_benchdnn_modeC_gru_float16_cpu
312/335 Test #312: test_benchdnn_modeC_gru_float16_cpu .....................   Passed    0.08 sec
        Start 313: test_benchdnn_modeC_gru_int8_cpu
313/335 Test #313: test_benchdnn_modeC_gru_int8_cpu ........................   Passed    0.03 sec
        Start 314: test_benchdnn_modeC_lstm_bf32_bfloat16_cpu
314/335 Test #314: test_benchdnn_modeC_lstm_bf32_bfloat16_cpu ..............   Passed   62.20 sec
        Start 315: test_benchdnn_modeC_lstm_bfloat16_cpu
315/335 Test #315: test_benchdnn_modeC_lstm_bfloat16_cpu ...................   Passed    0.31 sec
        Start 316: test_benchdnn_modeC_lstm_bfloat16_ymm_cpu
316/335 Test #316: test_benchdnn_modeC_lstm_bfloat16_ymm_cpu ...............   Passed    0.32 sec
        Start 317: test_benchdnn_modeC_lstm_f32_cpu
317/335 Test #317: test_benchdnn_modeC_lstm_f32_cpu ........................   Passed  327.29 sec
        Start 318: test_benchdnn_modeC_lstm_float16_cpu
318/335 Test #318: test_benchdnn_modeC_lstm_float16_cpu ....................   Passed    0.32 sec
        Start 319: test_benchdnn_modeC_lstm_int8_cpu
319/335 Test #319: test_benchdnn_modeC_lstm_int8_cpu .......................   Passed    0.30 sec
        Start 320: test_benchdnn_modeC_rnn_all_cpu
320/335 Test #320: test_benchdnn_modeC_rnn_all_cpu .........................   Passed  140.04 sec
        Start 321: test_benchdnn_modeC_rnn_bf32_bfloat16_cpu
321/335 Test #321: test_benchdnn_modeC_rnn_bf32_bfloat16_cpu ...............   Passed   20.60 sec
        Start 322: test_benchdnn_modeC_rnn_bfloat16_cpu
322/335 Test #322: test_benchdnn_modeC_rnn_bfloat16_cpu ....................   Passed    0.10 sec
        Start 323: test_benchdnn_modeC_rnn_float16_cpu
323/335 Test #323: test_benchdnn_modeC_rnn_float16_cpu .....................   Passed    0.10 sec
        Start 324: test_benchdnn_modeC_self_f32_cpu
324/335 Test #324: test_benchdnn_modeC_self_f32_cpu ........................   Passed    0.06 sec
        Start 325: test_benchdnn_modeC_shuffle_all_cpu
325/335 Test #325: test_benchdnn_modeC_shuffle_all_cpu .....................   Passed    9.49 sec
        Start 326: test_benchdnn_modeC_shuffle_bfloat16_cpu
326/335 Test #326: test_benchdnn_modeC_shuffle_bfloat16_cpu ................   Passed    0.02 sec
        Start 327: test_benchdnn_modeC_shuffle_float16_cpu
327/335 Test #327: test_benchdnn_modeC_shuffle_float16_cpu .................   Passed    0.02 sec
        Start 328: test_benchdnn_modeC_softmax_acl_cpu
328/335 Test #328: test_benchdnn_modeC_softmax_acl_cpu .....................   Passed    0.03 sec
        Start 329: test_benchdnn_modeC_softmax_all_cpu
329/335 Test #329: test_benchdnn_modeC_softmax_all_cpu .....................   Passed  352.44 sec
        Start 330: test_benchdnn_modeC_softmax_bfloat16_cpu
330/335 Test #330: test_benchdnn_modeC_softmax_bfloat16_cpu ................   Passed    0.23 sec
        Start 331: test_benchdnn_modeC_softmax_float16_cpu
331/335 Test #331: test_benchdnn_modeC_softmax_float16_cpu .................   Passed    0.22 sec
        Start 332: test_benchdnn_modeC_sum_all_cpu
332/335 Test #332: test_benchdnn_modeC_sum_all_cpu .........................   Passed    2.69 sec
        Start 333: test_benchdnn_modeC_sum_bfloat16_cpu
333/335 Test #333: test_benchdnn_modeC_sum_bfloat16_cpu ....................   Passed    0.23 sec
        Start 334: test_benchdnn_modeC_sum_float16_cpu
334/335 Test #334: test_benchdnn_modeC_sum_float16_cpu .....................   Passed    0.23 sec
        Start 335: noexcept-cpp
335/335 Test #335: noexcept-cpp ............................................   Passed    0.01 sec

100% tests passed, 0 tests failed out of 335

Total Test time (real) = 4667.56 sec


I have run the perf* test cases and it is giving me this performance.

perf_matmul_inference_batched

Performance Summary (GFlops Improvement)
Average Improvement: 30.9%

perf_matmul_training

Average Improvement: 581.07%
Median Improvement: 44.02%

perf_matmul_inference_lb

Average Improvement: 3.42x speedup (242% improvement) across all valid benchmarks
Median Improvement: 1.55x speedup (55% improvement)

I am just pasting some Gflops info without onednn changes and with onednn changes:

Perf Matmul Inference LB

Performance Report Of matmul large batch
Name Original GFLOPs Modified GFLOPs Improvement (x) Improvement (%)
GNMT:0*1 164.214 384.374 2.34x 134.07%
GNMT:1*1 265.652 894.157 3.37x 236.62%
WnD-512:0*1 772.359 1346.7 1.74x 74.36%
WnD-512:1*1 784.183 1187.7 1.51x 51.46%
WnD-512:2*1 453.494 760.867 1.68x 67.78%
resnet:ip1*1 90.1254 1724 19.13x 1812.98%
resnet_sparse:ip1*1 110.76 1480.19 13.36x 1236.62%
googlenet_v1:ip1*1 92.1656 1412.88 15.33x 1433.02%
googlenet_v1:ip2*1 261.521 1333.1 5.10x 409.75%
inceptionv3:ip1*1 139.086 1977.31 14.22x 1321.61%
VGG16:ip1*1 52.6008 4053.91 77.07x 7606.19%
VGG16:ip2*1 65.5996 2542.89 38.76x 3776.38%
VGG16:ip3*1 47.3232 394.87 8.34x 734.36%
VGG16:ip4*1 121.101 683.832 5.65x 464.76%
NCF:0*1 661.168 717.718 1.09x 8.55%
NCF:1*1 558.028 642.415 1.15x 15.12%
NCF:2*1 256.874 283.381 1.10x 10.32%
NCF:3*1 18.2597 11.2339 0.62x -38.48%
Alexnet:ip1*1 320.218 3628.01 11.33x 1033.08%
Alexnet:ip2*1 351.288 3128.06 8.91x 790.68%
Alexnet:ip3*1 443.303 3080.25 6.95x 594.84%
masknet:ip1*1 522.195 2922.14 5.60x 459.59%
masknet:ip2*1 1056.25 1817.74 1.72x 72.11%
masknet:ip3*1 1539.48 1495.31 0.97x -2.87%
masknet:ip4*1 420.456 1130.4 2.69x 168.85%
RNN-T:Encoder_cell1_Input*2 499.792 741.557 1.48x 48.37%
RNN-T:Encoder_cell1_Hidden*11 357.773 2010.39 5.62x 462.03%
RNN-T:Encoder_cell3_Input*1 335.125 2653.63 7.92x 692.06%
RNN-T:Prediction_Input*12 859.962 891.03 1.04x 3.61%
RNN-T:JointNet_Linear1*3 1223.47 1658.04 1.36x 35.52%
RNN-T:JointNet_Linear2*3 122.171 395.666 3.24x 223.86%
DLRM:0*1 43.4487 43.4487 1.00x 0.00%
DLRM:1*2 1048.49 1223.28 1.17x 16.68%
DLRM:2*1 557.495 641.958 1.15x 15.15%
DLRM:3*1 1221.58 1232.88 1.01x 0.92%
DLRM:4*1 1308.72 2032.19 1.55x 55.30%
DLRM:5*1 1795.56 1938.74 1.08x 7.97%
DLRM:7*1 30.5793 21.7662 0.71x -28.82%
BERT:MM_5*24 1024.66 919.494 0.90x -10.26%
Transformer_lt:Encoder_MM_5*6 131.099 123.769 0.94x -5.59%
Transformer_lt:Decoder_MM_5*240 84.6122 88.1706 1.04x 4.21%
Transformer_lt:Decoder_MM_yy20*240 20.7003 22.0387 1.06x 6.46%

I have run others too, that is also giving the performance benefit. perf_matmul_training

Details report of Perf Matmul Training
Operation Name Original GFLOPs New GFLOPs Improvement (%)
GNMT_train:FWD,0*1 255.796 577.93 125.93%
GNMT_train:FWD,1*1 213.356 1181.14 453.59%
GNMT_train:BWD_D,0*1 315.642 590.895 87.21%
GNMT_train:BWD_D,1*1 273.425 1191.61 335.81%
GNMT_train:BWD_W,0*1 292.157 335.077 14.69%
GNMT_train:BWD_W,1*1 384.289 411.557 7.10%
WnD-40_train:FWD,0*1 436.866 758.85 73.70%
WnD-40_train:FWD,1*1 406.974 532.134 30.75%
WnD-40_train:FWD,2*1 229.095 251.684 9.86%
WnD-40_train:BWD_D,0*1 272.789 812.602 197.88%
WnD-40_train:BWD_D,1*1 418.013 599.863 43.50%
WnD-40_train:BWD_D,2*1 225.569 226.982 0.63%
WnD-40_train:BWD_W,0*1 134.92 129.459 -4.05%
WnD-40_train:BWD_W,1*1 126.821 126.808 -0.01%
WnD-40_train:BWD_W,2*1 93.1004 101.648 9.18%
resnet_train:FWD,ip1*1 87.9965 1881.27 2037.99%
resnet_sparse_train:FWD,ip1*1 111.89 1637.81 1363.88%
resnet_train:BWD_D,ip1*1 107.769 1279.43 1087.17%
resnet_sparse_train:BWD_D,ip1*1 126.008 1170.93 829.25%
resnet_train:BWD_W,ip1*1 382.386 388.569 1.62%
resnet_sparse_train:BWD_W,ip1*1 225.692 225.923 0.10%
googlenet_v1_train:FWD,ip1*1 90.5023 1493.63 1550.38%
googlenet_v1_train:FWD,ip2*1 155.082 1451.61 836.13%
inceptionv3_train:FWD,ip1*1 135.547 2019.27 1389.70%
googlenet_v1_train:BWD_D,ip1*1 109.674 1331.91 1114.51%
googlenet_v1_train:BWD_D,ip2*1 220.256 1174.7 433.25%
inceptionv3_train:BWD_D,ip1*1 146.799 1407.31 858.80%
googlenet_v1_train:BWD_W,ip1*1 426.755 421.418 -1.25%
googlenet_v1_train:BWD_W,ip2*1 395.216 424.881 7.50%
inceptionv3_train:BWD_W,ip1*1 709.11 716.597 1.06%
VGG16_train:FWD,ip1*1 52.7683 4059.26 7592.56%
VGG16_train:FWD,ip2*1 67.6152 2540.64 3656.93%
VGG16_train:FWD,ip3*1 48.1058 458.098 852.28%
VGG16_train:FWD,ip4*1 122.528 713.05 482.10%
VGG16_train:BWD_D,ip1*1 78.8694 3049.83 3766.55%
VGG16_train:BWD_D,ip2*1 80.3912 2715.68 3277.72%
VGG16_train:BWD_D,ip3*1 144.421 242.304 67.78%
VGG16_train:BWD_D,ip4*1 262.371 790.497 201.29%
VGG16_train:BWD_W,ip1*1 197.683 215.704 9.12%
VGG16_train:BWD_W,ip2*1 218.549 223.991 2.49%
VGG16_train:BWD_W,ip3*1 187.881 197.001 4.85%
VGG16_train:BWD_W,ip4*1 221.383 222.324 0.42%
NCF_train:FWD,0*1 659.18 747.211 13.35%
NCF_train:FWD,1*1 556.694 684.544 22.97%
NCF_train:FWD,2*1 256.06 330.387 29.03%
NCF_train:FWD,3*1 18.1384 19.466 7.32%
NCF_train:BWD_D,0*1 704.312 750.246 6.52%
NCF_train:BWD_D,1*1 388.405 401.794 3.45%
NCF_train:BWD_D,2*1 188.175 191.035 1.52%
NCF_train:BWD_D,3*1 3.19066 3.15545 -1.10%
NCF_train:BWD_W,0*1 293.183 906.591 209.22%
NCF_train:BWD_W,1*1 170.72 648.933 280.11%
NCF_train:BWD_W,2*1 49.5902 444.603 796.56%
NCF_train:BWD_W,3*1 3.79355 5.4827 44.53%

Please let me know if you need anything else, i have run with the numactl. I have align numa node on my system and then ran this tests.

Tiwari-Avanish avatar May 30 '25 22:05 Tiwari-Avanish

Hi @spalicki I have updated the perf report that you have asked with numa node align. Could you please have a look into the comment and review it. If anything missing from my side please let me know. Thanks for guiding me with this PR.

Tiwari-Avanish avatar Jun 03 '25 19:06 Tiwari-Avanish

Hi @spalicki,

I have made the changes that you have suggested, and rebased it to the main branch. I have fixed the linter error as well, please have a look once.

Thanks @spalicki for reviewing the PR.

Tiwari-Avanish avatar Jun 04 '25 04:06 Tiwari-Avanish

Hi @spalicki Could you please have a look into this PR. I have made a changes that you have suggested and also done the rebase. Please look into this, if anything required please let me know.

Tiwari-Avanish avatar Jun 04 '25 22:06 Tiwari-Avanish

Hi @spalicki

Whenever you will get a time, please look the PR once. Thanks for your guidance.

Tiwari-Avanish avatar Jun 05 '25 17:06 Tiwari-Avanish

Hi @spalicki

Thanks for approving this changes.

The PR expects at-least two reviewers to approve the changes to merge into the main branch. Can you please suggest/add additional reviewer to review and approve the changes.

Tiwari-Avanish avatar Jun 06 '25 06:06 Tiwari-Avanish

Hi @spalicki

Could you please add or suggest somebody to review this PR. I will add them to this PR.

Tiwari-Avanish avatar Jun 11 '25 03:06 Tiwari-Avanish

Thanks @dzarukin for reviewing this, i have done the changes that you have asked for please review it. If any changes required please let me know.

Tiwari-Avanish avatar Jun 12 '25 18:06 Tiwari-Avanish

Hi @dzarukin

I have made the changes that you have asked. Could you please look into the changes that i have made based on your previous review. Whenever you are free please have a look once, and let me know if i need to do any changes.

Tiwari-Avanish avatar Jun 17 '25 15:06 Tiwari-Avanish

Thank you for the waiting and sorry it took longer to approve it.

dzarukin avatar Jun 18 '25 01:06 dzarukin

Thanks @dzarukin @spalicki for reviewing and approving this PR.

Could you please help me with merging this changes.

Tiwari-Avanish avatar Jun 18 '25 07:06 Tiwari-Avanish

This PR was reverted as it breaks compatibility with PowerPC systems without mma support. Changes are preserved in #3974.

vpirogov avatar Sep 19 '25 22:09 vpirogov

@spalicki @vpirogov
I will fix the issue and will restore again this change https://github.com/uxlfoundation/oneDNN/pull/3974.

Tiwari-Avanish avatar Sep 21 '25 20:09 Tiwari-Avanish