Enable Arm SVE2 for 128 bits vector target
TL;DR
This commit enables Halide to compile a pipeline into LLVM-IR with Scalable Vector Extension version two (SVE2) of the Armv9-A architecture, instead of Neon. LLVM version 14+ is required and the supported vector length is 128 bits only.
For Halide Users
What is this for
In a nut shell, SVE2 is a new SIMD set of Arm CPU and a superset of SVE and Neon (more details in the above link). Depending on the characteristic of the pipeline you compile by Halide, you could leverage the benefit of it to boost the performance.
Performance uplift might be possible if the pipeline has :
- Gather load
- Scatter store
- Predicated load/store
- Dot-product of 16 bit integer in such pattern where Neon dot-product of 8 bit works well
- 64 bit or float16 type operation which is not well supported in Neon
For example, some improvement was observed with apps/local_laplacian, apps/bilateral_grid, apps/camera_pipe and apps/nl_means in Halide repository.
On the other hand, it could result in worse performance than Neon, if :
- Widening or narrowing operation is dominant
- Vectorized with the factor value which is not the multiple of natural vector size (e.g. 3, 5, 6, 7, 9, etc)
Usage
To enable this feature, just add sve2 (Target::SVE2) feature and vector_bits_128 to Halide Target (e.g. arm-64-android-sve2-vector_bits_128). In terms of Halide scheduling, no API is updated by this PR, so just schedule in the same way as Neon. Other knowledge about the usage is as follows.
- Only
64 bitsOS on Arm architecture withSVE2capability is supported. - The supported vector length of the device is only
128 bitsat the moment, which meansvscalevalue of scalable vector type in LLVM is assumed to be1at compilation time. Runtime error is generated if executed on device where vector length is other than 128 bits. - LLVM version
14or later is required for compilation. As of issuing of this PR, SHA143f8a6b74931is used for verification. - Feature
SVE(i.e. without the suffix2) is not supported. - Feature
NoNeondisablesSVE2as well. - Feature
ARMDotProdandARMFp16are enabled implicitly by the featureSVE2.
For Halide Maintainers
Some of the key points of this commit are captured in below.
Vector Length Agnostic concept of SVE
This PR works as the initial step to enable SVE2 and SME in the future. The target vector length is assumed to be 128 bits at compilation time, aiming to give us performance uplift on latest smartphone SoC with SVE2 capability. The reason of this approach is as follows. SVE is designed as an embodiment of Vector Length Agnostic (VLA) concept, where the exact vector length (VL) is unknown at compilation time and obtained in run time. In LLVM-IR context, it is called "Scalable" Vector. However, Halide compiler assumes VL is compile-time fixed value and that assumption exists in many places in large SW stack of Halide. I think it would be a non trivial technical challenge if we try to incorporate VLA concept into Halide. On the other hand, from user perspective, when scheduling to explore better performance/memory bandwidth, we usually have the specific target processor in mind (i.e. we know the exact VL). Therefore, even if "Fixed sized vector length" is assumed at compilation time, I would argue enabling SVE2 features by Halide backend would provide substantial value and it would presumably make the complexity/effort small.
Shuffle Vector
By design, LLVM shufflevector doesn't accept scalable vector except for zero mask. Instead, llvm.experimental.vector.xxx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues.
- Supported operation pattern is limited. (e.g. no intrinsic for interleaving)
- AArch64 backend doesn't seem to be mature enough to compile arbitrary cases.(e.g. LLVM Error often occurs when dealing with vector lanes not power-of-two)
Therefore, lots of tricky workaround is implemented to process scalable vector and to avoid LLVM Errors, where some of them are using Arm SVE2 intrinsic.
Unsupported peep-hole patterns in SVE2
Some of the Arm intrinsic have the same name between Neon and SVE2 but with different behavior. The remarkable ones are, widening, narrowing and pair-wise operations which are performed in even (top) and odd (bottom) lanes basis in SVE, while in high and low lanes in Neon. Therefore, peep-hole code-gen of those patterns into SVE2 intrinsic is not enabled for now, because additional interleaving/deinterleaveing is required to restore the element order in a vector.
Workaround for LLVM issues
As of LLVM 14.0.3, LLVM Error often occurs with vanilla code-gen for scalable vector type with "unnatural" lanes. This commit has lots of workaround to avoid that by performing code-gen in natural lanes basis, where total_lanes are divided into slices, code-gen is performed for each slice, and results are concatenated into total_lanes. The list of LLVM issues are captured in the appendix.
Refactoring of unit tests for Arm SIMD
-
simd_op_check_arm.cppis created to merge Neon test cases insimd_op_check.cppandfloat16_t_neon_op_check.cpp. - Improved the verification of instruction so that operand is checked with data bit width and number of lanes
- Added SVE2 test cases
CMake Tests on emulator
To run test executables on emulator, TARGET_EMULATOR CMake variable is added, which is set to the argument of add_test() CMake function. For example, the value is the path to the wrapper script like:
#!/bin/bash
armie -msve-vector-bits=128 -i libinscount_emulated.so -- $@
Future work
The following is the list of remaining items and next steps going forward.
- Minor improvements, which may be incorporated after looking into more concrete pipelines in detail.
- Remove workaround for LLVM issues once they are fixed
- Support
vscalevalue other than1as a target feature. (e.g. vector bits of 256, 512 etc) - Support more extensions such as SVE (not SVE2), MMLA, and future extensions.
Nothing above is committed to be delivered. I'd be happy to hear what others think or want.
Appendix 1) Test results
Setup
| Item | Value |
|---|---|
| Host machine | arm-64-linux, Ubuntu 20.04 on AWS Graviton2 |
| Emulator for SVE | Arm Instruction Emulator |
| LLVM | SHA143f8a6b74931, Relase build |
| Halide build configuration | CMAKE_BUILD_TYPE:Debug, Halide_TARGET:arm-64-linux-sve2 |
| Environmental variables in test | HL_NUM_THREADS=1, due to the limitation of the emulator, otherwise .parallel() raises SIGSEGV |
| Command to run tests | ctest -V -C Debug -j14 |
Result
Target with SVE2
97% tests passed, 16 tests failed out of 584
In summary, most of the failures are due to the limitation of emulator environment. From the practical usage perspective, what might affects the end user experience the most is the issue found in correctness_rfactor.
Detail of failed cases:
| Test Item | Cause |
|---|---|
| correctness_async | Emulator limitation when scheduled with .async() |
| correctness_async_copy_chain | Emulator limitation when scheduled with .async() |
| correctness_atomics | Emulator limitation when scheduled with .async() |
| correctness_parallel_fork | Emulator limitation when scheduled with .async() |
| correctness_rfactor | LLVM Error in compilation, which the corresponding issue is #55405. The issues happens only in tuple_specialize_rdom_predicate_rfactor_test() |
| correctness_interleave | LLVM crashes with weird error after very long compilation time with huge memory usage. The issue happens only with the last test case of weird size 27, where number of values with unrealistic lanes <vscale x 729 x i16> are emitted |
| performance_fan_in | Measurement on emulator doesn't make sense |
| performance_fast_inverse | Measurement on emulator doesn't make sense |
| performance_inner_loop_parallel | Measurement on emulator doesn't make sense |
| performance_memcpy | Emulator crashes |
| performance_memory_profiler | Emulator crashes |
| performance_parallel_performance | Measurement on emulator doesn't make sense |
| performance_vectorize | Measurement on emulator doesn't make sense |
| generator_aot_async_parallel | Emulator limitation when scheduled with .async() |
| generator_aot_memory_profiler_mandelbrot | Emulator crashes |
| generator_aot_variable_num_threads | Emulator limitation when scheduled with .parallel() |
Target without SVE2
Target is set as Halide_TARGET: arm-64-linux-arm_dot_prod-arm_fp16. Test execution is performed without emulator.
99% tests passed, 1 tests failed out of 584
...
The following tests FAILED:
497 - performance_fast_inverse (Failed)
In my setup, the result is the same regardless of this PR.
Appendix 2) LLVM Issues
| Title | Link |
|---|---|
| [AArch64][SVE] Invalid size request on a scalable vector | #54424 |
| [AArch64][SVE] Unable to widen vector store | #54423 |
| [AArch64][SVE] Don't know how to widen the operands for INSERT_SUBVECTOR | #54982 |
| [AArch64][SVE] Error in insert_v4f32_v2f32 | #55037 |
| [AArch64][SVE] Error in bitcast on scalable vector | #55114 |
| [AArch64][SVE] Error in llvm.experimental.stepvector: Do not know how to widen the result of this operator | #55165 |
| [AArch64][SVE] Error in vector_reverse: Do not know how to widen the result of this operator | #55166 |
| [AArch64][SVE] Assertion fails in Casting | #55348 |
| [AArch64][SVE] Assertion fails in inserting sub vector of i1 | #55405 |
| [SVE] Assertion fails in PtrToInt with scalable vector | #55410 |
| [AArch64][SVE] No way to convert scalable vector into fixed sized vector | #55412 |
I'll start looking at this shortly.
Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch? I've mostly been looking at RISC V recently, but the branch does support SVE2 with longer than 128-bit vectors. I'll need to look this over, but hopefully there isn't much overlap and the mechanism I am using to set vscale to values other than 1 (vector_bits_* target flag) can just be hooked up in this PR to get support for longer vectors.
Does this PR allow generating code for an asserted fixed hardware vscale?
Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?
Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.
Does this PR allow generating code for an asserted fixed hardware vscale?
This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.
Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?
Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.
Does this PR allow generating code for an asserted fixed hardware vscale?
This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.
Yes, that is why I am asking. My PR supports generating code at a specific asserted vscale, which is what one wants for best optimized code targeted at known hardware. The question is whether this is just an assert to worry about or if there are other issues targeting vscale other than 1 but still fixed.
I assume this handles SVE2 implementations with larger vector widths than 128 via predication, which my PR currently does not as it asserts both min and max vscale. Not asserting the max would be pretty easy, but the intended use case is targeting exact sizes.
When you say increasing vscale greater than 1, is your idea that this will need to support arbitrary vscale to do that? Our impression was this is a great deal of work or a significant performance hit to do in general, but would be interested in your thoughts.
@zvookin This PR works as the initial step, where currently the target vector length is assumed to be 128 bits at compilation time. Targeting VL other than 128 bits will break somewhere in complex vector operation such as shuffle. Actualy, it is not tested on target other than 128 bits VL. The runtime assertion is to guard the user from such troubles.
I think it would be a non trivial technical challenge if we try to incorporate complete VLAgnostic concept into Halide. So the viable approach would be to assume vscale is a compile time fixed value, even for vscale other than 1.
High level, I want to have a vscale PR separated out from either SVE or RISC-V vector support. Then we can have separate PRs for each side. Probably the easiest way to do that is for me to put up a PR combining what's in fixed_length_vectors and what is here. I don't think the support here is quite right as it seems to assume vscale of 1 is 128-bits and does not appear to assert the vscale range on functions. But perhaps I am wrong.
On RISC V, I deal with shuffles by coercing to fixed vector, but this may require setting the fixed vector size options which are global and architecture specific. I'm pretty sure that method worked on SVE2 at one time, but I'd have only verified that it compiled to something, not run it under a simulator.
I agree with introducing vector bits in Target as the initial step. I thought we start with enabling full functionality of 128 bits for commercialized SoCs, which is the first step and adding other vscale values would be done as a next step over time. But now you are working on 256 bits or more, I realize the assumption of 128 bits might potentially cause a conflict somewhere commonly used. One of the mitigation I come up with is to modify the diff of CodeGen_LLVM so that in case 128 bits assumption could cause problem, do either:
- Move it to CodeGen_ARM
- Guard with target.vector_bits condition
- Make it work for vscale values not only 128
Does it make sense?
I previously explored some different approach. One was to use sve-aarch64-sve-vector-bits-max/min option, which turned out to be aimed for 256 bits and 512 bits only, and doesn't work for 128bits. The other was using fixed sized vector and converting it into/from scalable vector, which turned out to not be the level of maturity which we can rely on, as I faced frequently "Extracting a fixed-length vector from an illegal scalable vector is not yet supported" #55412
@zvookin Maybe I should rebase this PR on #6786 . Other than that, do you have any updates about your views to this change? It would be appreciated if you give me your thoughts on the question in my previous comment.
See https://github.com/halide/Halide/pull/6802 . It may be useful to put the concat_vectors and reverse_vectors support in this PR, though I believe what those routines do works here by converting for fixed vectors and back for shuffle_vectors. Whether that generates good code is an open question.
I did not put the code that sets the backend specific fixed length vector flags in that PR as I hope it will not be necessary. The PR does assert the vscale range on functions.
Let me know if there are any issues or if this doesn't look workable. Idea is both to factor the pull requests into more reviewable chunks and to make sure the SVE and RISC V Vector support line up together.
(Just catching up on existing PRs...) where does this stand? IIRC there were pieces that we wanted to break out and land separately.
My idea for the next step is to rebase this PR on top of other PRs for vscale vectors (#6786 and #6802) and modify the code to align with them (i.e. CodeGen_LLVM/Internal). Sorry for not updating, as I had been away from the work and also was waiting for #6802.
No worries, I also have been away :-)
This PR is ready to review. Commits have been rebased to main branch as of 5th July. LLVM which was used for this work is updated to SHA1:43f8a6b74931.
I've reorganized the commits so that single topic is implemented in single commit for the reviewers not to need to look across commits.
@zvookin, there are some updates for the topics we have discussed, which are captured in the following and I'm happy to incorporate your feedbacks.
Supported Vector length
This PR supports SVE2 with only 128 bits vector length (vscale=1). Other cases will be added overtime. That said, the changes in CodeGen_LLVM.cpp/h should work also with other vscale values than 1, aiming not to break the ongoing work for 256 bits etc.
Shuffle vectors
As mentioned in this commit, I didn't take the approach of performing shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed sized vector. The reason is, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance.
Helper APIs for scalable vector code-gen
As mentioned in this commit, APIs in #6802 are modified so that effective_vscale is taken into account implicitly in APIs of CodeGen_LLVM. More specifically, APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable, aiming to make caller code simple.
https://github.com/llvm/llvm-project/issues/56431
Monday Morning Review Ping -- where does this PR stand?
@steven-johnson I'm waiting for review feedback and approval if I understand the situation right. If there is anything on my side that could accelerate the review to go through, I would appreciate it.
Recent changes in main branch have been incorporated. Test results were the same as before.
I missed the update for a while, but now I'm more than happy to see this landed finally! Many thanks for all the efforts to make this happen🎉