sonic-cpp icon indicating copy to clipboard operation
sonic-cpp copied to clipboard

arm: optimize decoder on Arm SVE2 platform

Open cyb70289 opened this issue 1 year ago • 9 comments

This patch improves sonic json decoder performance on Arm SVE2 CPU. It leverages SVMATCH instruction to locate multiple tokens in a vector efficiently.

Enable this feature by specifying cmake option "-DENABLE_SVE2_128=ON". Please note the binary can only run on hardware with SVE2 supported, and the vector size must be 128 bits, like Neoverse-N2. Otherwise, the code behaviour is undefined.

As shown in the table below, tested on Bluewhale server, obvious performance uplift is observed from sonic decoder benchmarks. No side effect observed for other benchmarks.

Benchmark Original SVE2 Improvement
gsoc-2018/Decode_SonicDyn 2.38736 2.76677 15.89%
citm_catalog/Decode_SonicDyn 1.41729 1.76191 24.32%
otfcc/Decode_SonicDyn 399.916 413.417 3.38%
fgo/Decode_SonicDyn 691.597 716.301 3.57%
twitter/Decode_SonicDyn 1.33604 1.58737 18.81%
twitterescaped/Decode_SonicDyn 1.24759 1.30216 4.37%
github_events/Decode_SonicDyn 1.38961 1.65635 19.20%
canada/Decode_SonicDyn 526.145 524.517 -0.31%
poet/Decode_SonicDyn 2.06297 2.40383 16.52%
lottie/Decode_SonicDyn 419.902 438.824 4.51%
book/Decode_SonicDyn 456.615 487.196 6.70%

cyb70289 avatar Aug 05 '24 05:08 cyb70289

@cyb70289 What's unit of your benchmark results? HIB or LIB?

xiegx94 avatar Aug 05 '24 07:08 xiegx94

@cyb70289 What's unit of your benchmark results? HIB or LIB?

Gi/s and Mi/s, bytes per second.

As an example

$ build/benchmark/bench --benchmark_filter=Decode_Sonic
gsoc-2018/Decode_SonicDyn         1299148 ns      1299146 ns          537 bytes_per_second=2.38563Gi/s testdata/gsoc-2018.json
citm_catalog/Decode_SonicDyn      1136378 ns      1136290 ns          617 bytes_per_second=1.41565Gi/s testdata/citm_catalog.json
otfcc/Decode_SonicDyn           158508828 ns    158472460 ns            4 bytes_per_second=399.646Mi/s testdata/otfcc.json
fgo/Decode_SonicDyn              67084470 ns     67084360 ns            9 bytes_per_second=692.246Mi/s testdata/fgo.json
......

cyb70289 avatar Aug 05 '24 08:08 cyb70289

see #56,support sve as a different arch.

xiegx94 avatar Aug 05 '24 08:08 xiegx94

Thanks, will try to refactor following that PR. Instead of adding a complete SVE implementation, I'm thinking about "inherit" from NEON and only override code that can benefit from SVE. Looks to me many code will be the same for NEON and SVE.

cyb70289 avatar Aug 05 '24 09:08 cyb70289

@xiegx94 , sve2-128 implementation is added. Arm common code is moved to common/arm_common/. I checked sonic decoder benchmarks, no performance regression is found.

cyb70289 avatar Aug 06 '24 07:08 cyb70289

Any convenient way to run clang-format job locally?

cyb70289 avatar Aug 06 '24 07:08 cyb70289

Any convenient way to run clang-format job locally?

Could you install clang in your machine? If you have a clang-format, run git clang-format

xiegx94 avatar Aug 06 '24 08:08 xiegx94

Any convenient way to run clang-format job locally?

Could you install clang in your machine? If you have a clang-format, run git clang-format

Thanks, format should be fixed now.

cyb70289 avatar Aug 06 '24 08:08 cyb70289

"Test coverage" runs successfully on my local x86 server. Not sure why CI job fails. Looks it's only for x86?

cyb70289 avatar Aug 06 '24 10:08 cyb70289

@cyb70289 pls update cmake/set_arch_flags.cmake.

xiegx94 avatar Aug 29 '24 08:08 xiegx94

#93 FYI @cyb70289

xiegx94 avatar Aug 29 '24 08:08 xiegx94

@cyb70289 pls update cmake/set_arch_flags.cmake.

@xiegx94 updated

cyb70289 avatar Aug 29 '24 09:08 cyb70289