Simd swap
Used SIMD to improve bufferswap performance including below APIs:
buffer.swap16()
buffer.swap32()
buffer.swap64()
According to Node.js buffer API design, only the buffer over 128 bytes go into C++ level implementations. This optimization only works for this case. The optimization takes effect on X86 platforms that support AVX512vbmi instructions, e.g. Cannolake, Icelake, Icelale based AWS EC2 instances.
The code is placed in deps/ folder as platform dependent.
To achieve the best performance, it adopts AVX512 permute instruction and uses inlining instead of function call because the SIMD logic is too short.
The node.js buffer swap benchmark shows big improvements after applying this optimization which vary from ~30% to ~600%.
node-benchmark-compare
confidence improvement accuracy (*) (**) (***)
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='false' *** 70.56 % ±3.32% ±4.56% ±6.22%
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='true' *** 60.85 % ±3.78% ±5.19% ±7.09%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='false' *** 254.48 % ±3.07% ±4.40% ±6.44%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='true' *** 254.36 % ±2.71% ±3.88% ±5.68%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='false' *** 132.66 % ±3.45% ±4.95% ±7.27%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='true' *** 135.83 % ±4.25% ±6.09% ±8.91%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='false' *** 93.81 % ±2.70% ±3.71% ±5.06%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='true' *** 85.19 % ±3.74% ±5.26% ±7.45%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='false' *** 346.51 % ±4.85% ±6.90% ±9.98%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='true' *** 334.75 % ±7.03% ±10.10% ±14.84%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='false' *** 200.37 % ±3.26% ±4.68% ±6.88%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='true' *** 202.17 % ±4.13% ±5.94% ±8.73%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='false' *** 32.23 % ±6.39% ±9.04% ±12.94%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='true' *** 27.76 % ±3.85% ±5.31% ±7.29%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='false' *** 82.32 % ±4.38% ±6.08% ±8.44%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='true' *** 82.85 % ±12.90% ±18.01% ±25.27%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='false' *** 42.91 % ±2.00% ±2.74% ±3.75%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='true' *** 44.30 % ±2.94% ±4.03% ±5.50%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='false' -0.01 % ±0.35% ±0.48% ±0.66%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='true' -0.05 % ±0.19% ±0.26% ±0.36%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='false' 0.38 % ±1.76% ±2.42% ±3.30%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='true' 0.72 % ±2.06% ±2.82% ±3.85%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='false' 0.44 % ±1.00% ±1.37% ±1.86%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='true' -0.36 % ±0.48% ±0.66% ±0.90%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='false' *** 62.04 % ±5.14% ±7.21% ±10.20%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='true' *** 50.11 % ±4.66% ±6.43% ±8.86%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='false' *** 185.22 % ±5.40% ±7.75% ±11.38%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='true' *** 186.12 % ±2.80% ±3.94% ±5.59%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='false' *** 107.24 % ±2.48% ±3.56% ±5.24%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='true' *** 109.04 % ±5.19% ±7.45% ±10.96%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='false' *** 104.20 % ±2.08% ±2.85% ±3.89%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='true' *** 176.63 % ±58.59% ±84.16% ±123.81%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='false' *** 450.33 % ±4.58% ±6.57% ±9.65%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='true' *** 618.62 % ±161.33% ±231.77% ±340.97%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='false' *** 264.54 % ±3.58% ±5.15% ±7.58%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='true' *** 384.05 % ±110.97% ±159.42% ±234.52%
Review requested:
- [ ] @nodejs/gyp
I'm -1 on this for two main reasons:
- It's limited to a single, new SIMD instruction that isn't widespread yet
- I'm not convinced that swapping bytes is a common enough operation to warrant the level of optimization being made here
Additional notes:
- It's missing a license and the changes in tools/ for including said license if this is a true 3rd party dependency
- Using appropriate pragmas instead of compiler flags is the way to go (as was did with deps/zlib)
The code is not from 3rd party like Base64 or zlib, so there is no 3rd party license. This patch is to optimize the util.h itself to make bit swap faster. Bit swap is useful in endian conversion and network scenarios.
For your first comment, I'm not sure if Node.js project has a criteria for platform dependent code. Actually AVX512vbmi is an "old" instruction born 4 years ago, and widespread in e.g. AWS cloud instances nowadays. In another hand this is a performance optimization that would not downgrade performance on platforms not supporting AVX512vbmi. If you advised to support more SIMD instructions like AVX512f or AVX2, I think it OK to add them later.
Bit swap is useful in endian conversion
Right, but I don't see that as a common operation that would be a bottleneck in node applications.
Bit swap is useful in endian conversion
Right, but I don't see that as a common operation that would be a bottleneck in node applications.
Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.
Is there criteria for performance PR should be? In general I'm looking for parallel programming opportunities at any aspect to make Node.js runs at best performance.
Here are a few general suggestions:
- Look for any performance-related TODOs throughout the node codebase
- Consider submitting performance improvements for node's existing dependencies -- not only will that potentially benefit node, but perhaps other users as well
- Research common/popular use cases for node, break down the potential bottlenecks (e.g. network I/O, thread pool consumption, CPU usage, JS<->C++ boundary crossing, etc.) that can arise from those use cases, and see if any of them can be improved by changing the relevant parts in node
Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.
Offhand if I had to pick a specific feature that would be worth optimizing yet it would be supporting accelerated base64 encoding with the base64url alphabet or base64 decoding. Arguably these two would need to be (perhaps at least partially) supported upstream by the recently added base64 library, so if you want to tackle either of those you may want to start there.
I already submitted the PR to improve Base64 encoding in the new 3rd party base64-simd project although Base64 decoding has no feasible way to do due to the special empty chars allowed by Node.js base64.
I'm not sure why base64 meet the criteria. It is also a new added dependency and base64 usage scope is similar as byte swap. Is there complaint for base64 has performance issue in real world workloads?
Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node
I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.
I have similar concerns. Adding new things to
depsis a tough decision, and I don't expectswapX()to be used much, especially since these do not exist onUint8Array.
According to your comments, if I would like to contribute this optimization, would I add the simd_swap logic into util-inl.h directly instead of add a new deps folder? The there would be no concern of affecting deps folder, and simd_swap is just a pre-processor of buffers to be swapped.
Research common/popular use cases for node, break down the potential bottlenecks [..] and see if any of them can be improved by changing the relevant parts in node
I want to second this suggestion. The way I approach it is by profiling things like Express apps, create-react-app, the TypeScript compiler, etc. and follow where the numbers lead me.
I agree that using real workloads is also good to improve Node.js performance. I'm not against that and I'm working on real workloads like Ghost, micro services too. The guidance I followed in my performance work is that every micro bench is equally important as long as it exists in Node.js benchmark. I believe the purpose of maintaining benchmark by Node.js repo is to evaluate performance of each module and in low level instead of using big real workloads directly. Node.js is a programming platform and users can invoke every APIs at present and in future development. We cannot predict some API like buffer.swap64() would be less in use than APIs without limiting the users.
Don't attach too much weight to the benchmarks in the repo. Their primary function is to catch performance regressions. That they can be used to show performance improvements is just a nice secondary function.
Changing this patch after tasking all the considerations from review comments:
- Patch is too big that add a deps folder -> I made all the code changes into the util-inl.h file, and did not touch any other folders and files.
- Patch only adopted AVX512 Simd instruction which is not widespread -> I enabled ssse version and AVX512 version at the same time. So that all popular platforms can benefit from this patch. Most of known platforms support ssse
- Do not modify the gyp file -> I removed all code changes in gyp file, and uses attribute in the source code file like deps/zlib does.
For all platforms, the most performance gains is >5X.
For the platforms supporting AVX512, it achieved best performance gain, the charts for comparison as below:

For the platforms supporting SSE only, it also achieved good performance, charts as below:
