Apple AMX GEMM optimization
Progress:
- [x] packB 32 & 32x8 microkernel & test pass
- [ ] 32x32 / 32x64 / 64x64 microkernel && packing and unpacking
Unfortunately, I've been too busy with my internship works to fully finish this optimization (only implemented 32x8 microkernels with packB 32). The performance gain could be much higher if fully implemented.
Benchmarking
test_gemm.param.zip benchncnn.cpp:
benchmark("test_gemm1024", ncnn::Mat(1024, 1024, 3), opt);
benchmark("test_gemm2048", ncnn::Mat(2048, 2048, 3), opt);
benchmark("test_gemm4096", ncnn::Mat(4096, 4096, 3), opt);
benchmark("test_gemm8192", ncnn::Mat(8192, 8192, 3), opt);
32 layers of [dim, dim] @ [dim, dim] gemms on Apple M4:
# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:40:06]
$ ../build/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
test_gemm1024 min = 2.92 max = 5.42 avg = 3.04
test_gemm2048 min = 10.93 max = 20.71 avg = 11.49
test_gemm4096 min = 42.90 max = 73.63 avg = 44.38
test_gemm8192 min = 172.38 max = 1099.07 avg = 211.27
# molly @ mollydeMac-mini in ~/ncnn/benchmark on git:apple-amx-remastered x [10:53:14]
$ ../build-noamx/benchmark/benchncnn.app/Contents/MacOS/benchncnn 100 4 2 -1 1
loop_count = 100
num_threads = 4
powersave = 2
gpu_device = -1
cooling_down = 1
test_gemm1024 min = 5.35 max = 7.99 avg = 6.43
test_gemm2048 min = 21.85 max = 29.20 avg = 24.55
test_gemm4096 min = 89.58 max = 97.65 avg = 93.08
test_gemm8192 min = 511.64 max = 2068.80 avg = 1229.32
Testing
# molly @ mollydeMac-mini in ~/ncnn/build on git:apple-amx-remastered x [10:03:21]
$ ctest --output-on-failure -j10
Test project /Users/molly/ncnn/build
Start 17: test_binaryop_3
Start 123: test_slice
Start 24: test_convolution
Start 26: test_convolution_2
Start 25: test_convolution_1
Start 125: test_softmax
Start 71: test_gemm_3
Start 38: test_crop_1
Start 43: test_deconvolution
Start 16: test_binaryop_2
1/135 Test #43: test_deconvolution ............... Passed 0.67 sec
Start 69: test_gemm
2/135 Test #71: test_gemm_3 ...................... Passed 1.22 sec
Start 53: test_deformableconv2d_2
3/135 Test #38: test_crop_1 ...................... Passed 1.25 sec
Start 52: test_deformableconv2d_1
4/135 Test #125: test_softmax ..................... Passed 1.41 sec
Start 31: test_convolutiondepthwise
5/135 Test #16: test_binaryop_2 .................. Passed 1.48 sec
Start 54: test_deformableconv2d_3
6/135 Test #26: test_convolution_2 ............... Passed 1.64 sec
Start 15: test_binaryop_1
7/135 Test #69: test_gemm ........................ Passed 1.01 sec
Start 72: test_gemm_4
8/135 Test #24: test_convolution ................. Passed 1.82 sec
Start 14: test_binaryop
9/135 Test #123: test_slice ....................... Passed 1.98 sec
Start 51: test_deformableconv2d
10/135 Test #25: test_convolution_1 ............... Passed 2.11 sec
Start 42: test_cumulativesum
11/135 Test #52: test_deformableconv2d_1 .......... Passed 0.95 sec
Start 37: test_crop
12/135 Test #17: test_binaryop_3 .................. Passed 2.24 sec
Start 30: test_convolution3d
13/135 Test #54: test_deformableconv2d_3 .......... Passed 0.85 sec
Start 70: test_gemm_1
14/135 Test #53: test_deformableconv2d_2 .......... Passed 1.13 sec
Start 47: test_deconvolutiondepthwise_1
15/135 Test #72: test_gemm_4 ...................... Passed 0.82 sec
Start 27: test_convolution_3
16/135 Test #31: test_convolutiondepthwise ........ Passed 1.10 sec
Start 46: test_deconvolutiondepthwise
17/135 Test #15: test_binaryop_1 .................. Passed 1.00 sec
Start 100: test_pooling3d
18/135 Test #14: test_binaryop .................... Passed 0.93 sec
Start 45: test_deconvolution3d
19/135 Test #51: test_deformableconv2d ............ Passed 0.79 sec
Start 49: test_deconvolutiondepthwise3d
20/135 Test #42: test_cumulativesum ............... Passed 0.74 sec
Start 36: test_copyto_1
21/135 Test #47: test_deconvolutiondepthwise_1 .... Passed 0.58 sec
Start 135: test_yolov3detectionoutput
22/135 Test #37: test_crop ........................ Passed 0.84 sec
Start 35: test_copyto
23/135 Test #70: test_gemm_1 ...................... Passed 0.77 sec
Start 89: test_multiheadattention
24/135 Test #30: test_convolution3d ............... Passed 0.89 sec
Start 124: test_slice_oom
25/135 Test #46: test_deconvolutiondepthwise ...... Passed 0.67 sec
Start 95: test_padding
26/135 Test #100: test_pooling3d ................... Passed 0.59 sec
Start 29: test_convolution1d
27/135 Test #45: test_deconvolution3d ............. Passed 0.58 sec
Start 112: test_reshape_1
28/135 Test #49: test_deconvolutiondepthwise3d .... Passed 0.57 sec
Start 90: test_multiheadattention_1
29/135 Test #135: test_yolov3detectionoutput ....... Passed 0.49 sec
Start 75: test_gru
30/135 Test #27: test_convolution_3 ............... Passed 0.94 sec
Start 73: test_gridsample
31/135 Test #36: test_copyto_1 .................... Passed 0.60 sec
Start 108: test_reorg
32/135 Test #89: test_multiheadattention .......... Passed 0.44 sec
Start 44: test_deconvolution1d
33/135 Test #124: test_slice_oom ................... Passed 0.48 sec
Start 80: test_interp
34/135 Test #95: test_padding ..................... Passed 0.51 sec
Start 98: test_pooling
35/135 Test #112: test_reshape_1 ................... Passed 0.39 sec
Start 96: test_permute
36/135 Test #108: test_reorg ....................... Passed 0.32 sec
Start 114: test_rmsnorm
37/135 Test #75: test_gru ......................... Passed 0.38 sec
Start 22: test_concat
38/135 Test #35: test_copyto ...................... Passed 0.76 sec
Start 74: test_groupnorm
39/135 Test #90: test_multiheadattention_1 ........ Passed 0.49 sec
Start 102: test_prelu
40/135 Test #29: test_convolution1d ............... Passed 0.61 sec
Start 109: test_requantize
41/135 Test #73: test_gridsample .................. Passed 0.46 sec
Start 126: test_softmax_oom
42/135 Test #44: test_deconvolution1d ............. Passed 0.43 sec
Start 132: test_tile
43/135 Test #96: test_permute ..................... Passed 0.33 sec
Start 85: test_lstm
44/135 Test #98: test_pooling ..................... Passed 0.40 sec
Start 48: test_deconvolutiondepthwise1d
45/135 Test #114: test_rmsnorm ..................... Passed 0.32 sec
Start 92: test_noop
46/135 Test #74: test_groupnorm ................... Passed 0.43 sec
Start 127: test_softplus
47/135 Test #102: test_prelu ....................... Passed 0.42 sec
Start 106: test_reduction
48/135 Test #22: test_concat ...................... Passed 0.50 sec
Start 59: test_einsum
49/135 Test #109: test_requantize .................. Passed 0.46 sec
Start 91: test_multiheadattention_oom
50/135 Test #126: test_softmax_oom ................. Passed 0.41 sec
Start 86: test_matmul
51/135 Test #132: test_tile ........................ Passed 0.38 sec
Start 55: test_deformableconv2d_4
52/135 Test #48: test_deconvolutiondepthwise1d .... Passed 0.42 sec
Start 118: test_scale
53/135 Test #92: test_noop ........................ Passed 0.42 sec
Start 99: test_pooling1d
54/135 Test #85: test_lstm ........................ Passed 0.46 sec
Start 60: test_eltwise
55/135 Test #127: test_softplus .................... Passed 0.34 sec
Start 111: test_reshape
56/135 Test #106: test_reduction ................... Passed 0.40 sec
Start 50: test_deepcopy
57/135 Test #91: test_multiheadattention_oom ...... Passed 0.49 sec
Start 67: test_gelu
58/135 Test #80: test_interp ...................... Passed 1.19 sec
Start 94: test_packing
59/135 Test #86: test_matmul ...................... Passed 0.48 sec
Start 128: test_spectrogram
60/135 Test #59: test_einsum ...................... Passed 0.53 sec
Start 107: test_relu
61/135 Test #55: test_deformableconv2d_4 .......... Passed 0.50 sec
Start 119: test_selu
62/135 Test #118: test_scale ....................... Passed 0.35 sec
Start 84: test_lrn
63/135 Test #50: test_deepcopy .................... Passed 0.36 sec
Start 64: test_expanddims
64/135 Test #99: test_pooling1d ................... Passed 0.50 sec
Start 66: test_fold
65/135 Test #60: test_eltwise ..................... Passed 0.51 sec
Start 61: test_elu
66/135 Test #111: test_reshape ..................... Passed 0.51 sec
Start 88: test_mish
67/135 Test #67: test_gelu ........................ Passed 0.31 sec
Start 115: test_rnn
68/135 Test #94: test_packing ..................... Passed 0.31 sec
Start 97: test_pixelshuffle
69/135 Test #119: test_selu ........................ Passed 0.44 sec
Start 58: test_dropout
70/135 Test #128: test_spectrogram ................. Passed 0.47 sec
71/135 Test #84: test_lrn ......................... Passed 0.44 sec
Start 56: test_dequantize
Start 78: test_innerproduct
72/135 Test #107: test_relu ........................ Passed 0.47 sec
Start 65: test_flatten
73/135 Test #64: test_expanddims .................. Passed 0.34 sec
Start 87: test_memorydata
74/135 Test #66: test_fold ........................ Passed 0.34 sec
Start 120: test_shrink
75/135 Test #61: test_elu ......................... Passed 0.51 sec
Start 116: test_roipooling
76/135 Test #97: test_pixelshuffle ................ Passed 0.43 sec
Start 129: test_squeeze
77/135 Test #88: test_mish ........................ Passed 0.46 sec
Start 57: test_diag
78/135 Test #115: test_rnn ......................... Passed 0.48 sec
Start 101: test_power
79/135 Test #58: test_dropout ..................... Passed 0.30 sec
Start 131: test_tanh
80/135 Test #78: test_innerproduct ................ Passed 0.31 sec
Start 93: test_normalize
81/135 Test #87: test_memorydata .................. Passed 0.47 sec
82/135 Test #120: test_shrink ...................... Passed 0.46 sec
Start 113: test_reshape_oom
Start 103: test_priorbox
83/135 Test #65: test_flatten ..................... Passed 0.53 sec
Start 117: test_roialign
84/135 Test #56: test_dequantize .................. Passed 0.54 sec
Start 110: test_requantize_oom
85/135 Test #129: test_squeeze ..................... Passed 0.34 sec
Start 104: test_quantize
86/135 Test #116: test_roipooling .................. Passed 0.35 sec
Start 76: test_hardsigmoid
87/135 Test #57: test_diag ........................ Passed 0.51 sec
Start 82: test_inversespectrogram
88/135 Test #131: test_tanh ........................ Passed 0.46 sec
Start 83: test_layernorm
89/135 Test #101: test_power ....................... Passed 0.48 sec
Start 105: test_quantize_oom
90/135 Test #117: test_roialign .................... Passed 0.32 sec
Start 133: test_unaryop
91/135 Test #113: test_reshape_oom ................. Passed 0.32 sec
Start 79: test_instancenorm
92/135 Test #110: test_requantize_oom .............. Passed 0.51 sec
Start 40: test_crop_3
93/135 Test #104: test_quantize .................... Passed 0.46 sec
Start 122: test_sigmoid
94/135 Test #103: test_priorbox .................... Passed 0.53 sec
Start 130: test_swish
95/135 Test #76: test_hardsigmoid ................. Passed 0.49 sec
Start 77: test_hardswish
96/135 Test #82: test_inversespectrogram .......... Passed 0.36 sec
Start 63: test_erf
97/135 Test #93: test_normalize ................... Passed 0.85 sec
Start 68: test_glu
98/135 Test #79: test_instancenorm ................ Passed 0.44 sec
Start 32: test_convolutiondepthwise_1
99/135 Test #105: test_quantize_oom ................ Passed 0.52 sec
Start 81: test_interp_1
100/135 Test #83: test_layernorm ................... Passed 0.53 sec
Start 121: test_shufflechannel
101/135 Test #133: test_unaryop ..................... Passed 0.50 sec
Start 62: test_embed
102/135 Test #122: test_sigmoid ..................... Passed 0.32 sec
Start 134: test_unfold
103/135 Test #40: test_crop_3 ...................... Passed 0.37 sec
Start 41: test_crop_oom
104/135 Test #63: test_erf ......................... Passed 0.45 sec
Start 39: test_crop_2
105/135 Test #77: test_hardswish ................... Passed 0.48 sec
Start 6: test_squeezenet
106/135 Test #130: test_swish ....................... Passed 0.51 sec
Start 34: test_convolutiondepthwise3d
107/135 Test #68: test_glu ......................... Passed 0.41 sec
Start 33: test_convolutiondepthwise1d
108/135 Test #32: test_convolutiondepthwise_1 ...... Passed 0.46 sec
Start 5: test_mat_pixel
109/135 Test #81: test_interp_1 .................... Passed 0.49 sec
Start 4: test_mat_pixel_resize
110/135 Test #121: test_shufflechannel .............. Passed 0.52 sec
Start 28: test_convolution_oom
111/135 Test #134: test_unfold ...................... Passed 0.44 sec
Start 20: test_celu
112/135 Test #62: test_embed ....................... Passed 0.48 sec
Start 13: test_bias
113/135 Test #41: test_crop_oom .................... Passed 0.40 sec
Start 9: test_expression
114/135 Test #39: test_crop_2 ...................... Passed 0.32 sec
Start 21: test_clip
115/135 Test #4: test_mat_pixel_resize ............ Passed 0.32 sec
Start 2: test_mat_pixel_drawing
116/135 Test #5: test_mat_pixel ................... Passed 0.35 sec
Start 19: test_cast
117/135 Test #28: test_convolution_oom ............. Passed 0.31 sec
Start 18: test_bnll
118/135 Test #33: test_convolutiondepthwise1d ...... Passed 0.57 sec
Start 12: test_batchnorm
119/135 Test #34: test_convolutiondepthwise3d ...... Passed 0.63 sec
Start 3: test_mat_pixel_rotate
120/135 Test #6: test_squeezenet .................. Passed 0.67 sec
Start 7: test_c_api
121/135 Test #9: test_expression .................. Passed 0.52 sec
Start 10: test_paramdict
122/135 Test #20: test_celu ........................ Passed 0.53 sec
Start 11: test_absval
123/135 Test #13: test_bias ........................ Passed 0.52 sec
Start 1: test_mat_pixel_affine
124/135 Test #21: test_clip ........................ Passed 0.45 sec
Start 23: test_concat_oom
125/135 Test #2: test_mat_pixel_drawing ........... Passed 0.32 sec
Start 8: test_cpu
126/135 Test #19: test_cast ........................ Passed 0.53 sec
127/135 Test #3: test_mat_pixel_rotate ............ Passed 0.42 sec
128/135 Test #18: test_bnll ........................ Passed 0.51 sec
129/135 Test #7: test_c_api ....................... Passed 0.39 sec
130/135 Test #12: test_batchnorm ................... Passed 0.49 sec
131/135 Test #10: test_paramdict ................... Passed 0.32 sec
132/135 Test #8: test_cpu ......................... Passed 0.46 sec
133/135 Test #11: test_absval ...................... Passed 0.54 sec
134/135 Test #23: test_concat_oom .................. Passed 0.54 sec
135/135 Test #1: test_mat_pixel_affine ............ Passed 0.55 sec
100% tests passed, 0 tests failed out of 135
Total Test time (real) = 8.19 sec
The binary size change of libncnn.so (bytes)
| architecture | base size | pr size | difference |
|---|---|---|---|
| x86_64 | 15124728 | 15124784 | +56 :warning: |
| armhf | 6155744 | 6155824 | +80 :warning: |
| aarch64 | 9453192 | 9452928 | -264 :kissing_heart: |
Please enable github action in YOUR FORKED REPO to make code-format workflow work
Codecov Report
:x: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 95.59%. Comparing base (a514cf5) to head (da499ad).
:warning: Report is 9 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/cpu.cpp | 0.00% | 3 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #6293 +/- ##
==========================================
- Coverage 95.89% 95.59% -0.30%
==========================================
Files 837 837
Lines 264994 264997 +3
==========================================
- Hits 254105 253327 -778
- Misses 10889 11670 +781
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.