TL;DR

This commit enables Halide to compile a pipeline into LLVM-IR with Scalable Vector Extension version two (SVE2) of the Armv9-A architecture, instead of Neon. LLVM version 14+ is required and the supported vector length is 128 bits only.

For Halide Users

What is this for

In a nut shell, SVE2 is a new SIMD set of Arm CPU and a superset of SVE and Neon (more details in the above link). Depending on the characteristic of the pipeline you compile by Halide, you could leverage the benefit of it to boost the performance.

Performance uplift might be possible if the pipeline has :

Gather load
Scatter store
Predicated load/store
Dot-product of 16 bit integer in such pattern where Neon dot-product of 8 bit works well
64 bit or float16 type operation which is not well supported in Neon

For example, some improvement was observed with apps/local_laplacian, apps/bilateral_grid, apps/camera_pipe and apps/nl_means in Halide repository.

On the other hand, it could result in worse performance than Neon, if :

Widening or narrowing operation is dominant
Vectorized with the factor value which is not the multiple of natural vector size (e.g. 3, 5, 6, 7, 9, etc)

Usage

To enable this feature, just add sve2 (Target::SVE2) feature and vector_bits_128 to Halide Target (e.g. arm-64-android-sve2-vector_bits_128). In terms of Halide scheduling, no API is updated by this PR, so just schedule in the same way as Neon. Other knowledge about the usage is as follows.

Only 64 bits OS on Arm architecture with SVE2 capability is supported.
The supported vector length of the device is only 128 bits at the moment, which means vscale value of scalable vector type in LLVM is assumed to be 1 at compilation time. Runtime error is generated if executed on device where vector length is other than 128 bits.
LLVM version 14 or later is required for compilation. As of issuing of this PR, SHA143f8a6b74931 is used for verification.
Feature SVE (i.e. without the suffix 2) is not supported.
Feature NoNeon disables SVE2 as well.
Feature ARMDotProd and ARMFp16 are enabled implicitly by the feature SVE2.

For Halide Maintainers

Some of the key points of this commit are captured in below.

Vector Length Agnostic concept of SVE

This PR works as the initial step to enable SVE2 and SME in the future. The target vector length is assumed to be 128 bits at compilation time, aiming to give us performance uplift on latest smartphone SoC with SVE2 capability. The reason of this approach is as follows. SVE is designed as an embodiment of Vector Length Agnostic (VLA) concept, where the exact vector length (VL) is unknown at compilation time and obtained in run time. In LLVM-IR context, it is called "Scalable" Vector. However, Halide compiler assumes VL is compile-time fixed value and that assumption exists in many places in large SW stack of Halide. I think it would be a non trivial technical challenge if we try to incorporate VLA concept into Halide. On the other hand, from user perspective, when scheduling to explore better performance/memory bandwidth, we usually have the specific target processor in mind (i.e. we know the exact VL). Therefore, even if "Fixed sized vector length" is assumed at compilation time, I would argue enabling SVE2 features by Halide backend would provide substantial value and it would presumably make the complexity/effort small.

Shuffle Vector

By design, LLVM shufflevector doesn't accept scalable vector except for zero mask. Instead, llvm.experimental.vector.xxx intrinsic supports scalable vector. However, as of LLVM 14, there are a few non trivial issues.

Supported operation pattern is limited. (e.g. no intrinsic for interleaving)
AArch64 backend doesn't seem to be mature enough to compile arbitrary cases.(e.g. LLVM Error often occurs when dealing with vector lanes not power-of-two)

Therefore, lots of tricky workaround is implemented to process scalable vector and to avoid LLVM Errors, where some of them are using Arm SVE2 intrinsic.

Unsupported peep-hole patterns in SVE2

Some of the Arm intrinsic have the same name between Neon and SVE2 but with different behavior. The remarkable ones are, widening, narrowing and pair-wise operations which are performed in even (top) and odd (bottom) lanes basis in SVE, while in high and low lanes in Neon. Therefore, peep-hole code-gen of those patterns into SVE2 intrinsic is not enabled for now, because additional interleaving/deinterleaveing is required to restore the element order in a vector.

Workaround for LLVM issues

As of LLVM 14.0.3, LLVM Error often occurs with vanilla code-gen for scalable vector type with "unnatural" lanes. This commit has lots of workaround to avoid that by performing code-gen in natural lanes basis, where total_lanes are divided into slices, code-gen is performed for each slice, and results are concatenated into total_lanes. The list of LLVM issues are captured in the appendix.

Refactoring of unit tests for Arm SIMD

simd_op_check_arm.cpp is created to merge Neon test cases in simd_op_check.cpp and float16_t_neon_op_check.cpp.
Improved the verification of instruction so that operand is checked with data bit width and number of lanes
Added SVE2 test cases

CMake Tests on emulator

To run test executables on emulator, TARGET_EMULATOR CMake variable is added, which is set to the argument of add_test() CMake function. For example, the value is the path to the wrapper script like:

#!/bin/bash
armie -msve-vector-bits=128 -i libinscount_emulated.so -- $@

Future work

The following is the list of remaining items and next steps going forward.

Minor improvements, which may be incorporated after looking into more concrete pipelines in detail.
Remove workaround for LLVM issues once they are fixed
Support vscale value other than 1 as a target feature. (e.g. vector bits of 256, 512 etc)
Support more extensions such as SVE (not SVE2), MMLA, and future extensions.

Nothing above is committed to be delivered. I'd be happy to hear what others think or want.

Appendix 1) Test results

Setup

Item	Value
Host machine	`arm-64-linux`, Ubuntu 20.04 on AWS Graviton2
Emulator for SVE	Arm Instruction Emulator
LLVM	SHA1`43f8a6b74931`, `Relase` build
Halide build configuration	`CMAKE_BUILD_TYPE`:`Debug`, `Halide_TARGET`:`arm-64-linux-sve2`
Environmental variables in test	`HL_NUM_THREADS=1`, due to the limitation of the emulator, otherwise `.parallel()` raises SIGSEGV
Command to run tests	`ctest -V -C Debug -j14`

Result

Target with SVE2

97% tests passed, 16 tests failed out of 584

In summary, most of the failures are due to the limitation of emulator environment. From the practical usage perspective, what might affects the end user experience the most is the issue found in correctness_rfactor.

Detail of failed cases:

Test Item	Cause
correctness_async	Emulator limitation when scheduled with `.async()`
correctness_async_copy_chain	Emulator limitation when scheduled with `.async()`
correctness_atomics	Emulator limitation when scheduled with `.async()`
correctness_parallel_fork	Emulator limitation when scheduled with `.async()`
correctness_rfactor	LLVM Error in compilation, which the corresponding issue is #55405. The issues happens only in `tuple_specialize_rdom_predicate_rfactor_test()`
correctness_interleave	LLVM crashes with weird error after very long compilation time with huge memory usage. The issue happens only with the last test case of weird size `27`, where number of values with unrealistic lanes `<vscale x 729 x i16>` are emitted
performance_fan_in	Measurement on emulator doesn't make sense
performance_fast_inverse	Measurement on emulator doesn't make sense
performance_inner_loop_parallel	Measurement on emulator doesn't make sense
performance_memcpy	Emulator crashes
performance_memory_profiler	Emulator crashes
performance_parallel_performance	Measurement on emulator doesn't make sense
performance_vectorize	Measurement on emulator doesn't make sense
generator_aot_async_parallel	Emulator limitation when scheduled with `.async()`
generator_aot_memory_profiler_mandelbrot	Emulator crashes
generator_aot_variable_num_threads	Emulator limitation when scheduled with `.parallel()`

Target without SVE2

Target is set as Halide_TARGET: arm-64-linux-arm_dot_prod-arm_fp16. Test execution is performed without emulator.

99% tests passed, 1 tests failed out of 584
...
The following tests FAILED:
	497 - performance_fast_inverse (Failed)

In my setup, the result is the same regardless of this PR.

Appendix 2) LLVM Issues

Title	Link
[AArch64][SVE] Invalid size request on a scalable vector	#54424
[AArch64][SVE] Unable to widen vector store	#54423
[AArch64][SVE] Don't know how to widen the operands for INSERT_SUBVECTOR	#54982
[AArch64][SVE] Error in insert_v4f32_v2f32	#55037
[AArch64][SVE] Error in bitcast on scalable vector	#55114
[AArch64][SVE] Error in llvm.experimental.stepvector: Do not know how to widen the result of this operator	#55165
[AArch64][SVE] Error in vector_reverse: Do not know how to widen the result of this operator	#55166
[AArch64][SVE] Assertion fails in Casting	#55348
[AArch64][SVE] Assertion fails in inserting sub vector of i1	#55405
[SVE] Assertion fails in PtrToInt with scalable vector	#55410
[AArch64][SVE] No way to convert scalable vector into fixed sized vector	#55412

May 25 '22 14:05 stevesuzuki-arm

I'll start looking at this shortly.

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch? I've mostly been looking at RISC V recently, but the branch does support SVE2 with longer than 128-bit vectors. I'll need to look this over, but hopefully there isn't much overlap and the mechanism I am using to set vscale to values other than 1 (vector_bits_* target flag) can just be hooked up in this PR to get support for longer vectors.

May 25 '22 15:05 zvookin

Does this PR allow generating code for an asserted fixed hardware vscale?

May 25 '22 15:05 zvookin

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

May 25 '22 15:05 stevesuzuki-arm

Have you looked at the https://github.com/halide/Halide/tree/fixed_length_vectors branch?

Thank you @zvookin. No, for the commits after branch name was changed long ago and yes before that. I'll have a look but I guess there should probably be much intersection when we try to support vscale > 1 in the next step.

Does this PR allow generating code for an asserted fixed hardware vscale?

This PR supports only vscale=1. Runtime assertion fails if the program is executed on vscale > 1 hardware.

Yes, that is why I am asking. My PR supports generating code at a specific asserted vscale, which is what one wants for best optimized code targeted at known hardware. The question is whether this is just an assert to worry about or if there are other issues targeting vscale other than 1 but still fixed.

I assume this handles SVE2 implementations with larger vector widths than 128 via predication, which my PR currently does not as it asserts both min and max vscale. Not asserting the max would be pretty easy, but the intended use case is targeting exact sizes.

When you say increasing vscale greater than 1, is your idea that this will need to support arbitrary vscale to do that? Our impression was this is a great deal of work or a significant performance hit to do in general, but would be interested in your thoughts.

May 25 '22 16:05 zvookin

@zvookin This PR works as the initial step, where currently the target vector length is assumed to be 128 bits at compilation time. Targeting VL other than 128 bits will break somewhere in complex vector operation such as shuffle. Actualy, it is not tested on target other than 128 bits VL. The runtime assertion is to guard the user from such troubles.

I think it would be a non trivial technical challenge if we try to incorporate complete VLAgnostic concept into Halide. So the viable approach would be to assume vscale is a compile time fixed value, even for vscale other than 1.

May 25 '22 16:05 stevesuzuki-arm

High level, I want to have a vscale PR separated out from either SVE or RISC-V vector support. Then we can have separate PRs for each side. Probably the easiest way to do that is for me to put up a PR combining what's in fixed_length_vectors and what is here. I don't think the support here is quite right as it seems to assume vscale of 1 is 128-bits and does not appear to assert the vscale range on functions. But perhaps I am wrong.

On RISC V, I deal with shuffles by coercing to fixed vector, but this may require setting the fixed vector size options which are global and architecture specific. I'm pretty sure that method worked on SVE2 at one time, but I'd have only verified that it compiled to something, not run it under a simulator.

May 26 '22 19:05 zvookin

I agree with introducing vector bits in Target as the initial step. I thought we start with enabling full functionality of 128 bits for commercialized SoCs, which is the first step and adding other vscale values would be done as a next step over time. But now you are working on 256 bits or more, I realize the assumption of 128 bits might potentially cause a conflict somewhere commonly used. One of the mitigation I come up with is to modify the diff of CodeGen_LLVM so that in case 128 bits assumption could cause problem, do either:

Move it to CodeGen_ARM
Guard with target.vector_bits condition
Make it work for vscale values not only 128

Does it make sense?

I previously explored some different approach. One was to use sve-aarch64-sve-vector-bits-max/min option, which turned out to be aimed for 256 bits and 512 bits only, and doesn't work for 128bits. The other was using fixed sized vector and converting it into/from scalable vector, which turned out to not be the level of maturity which we can rely on, as I faced frequently "Extracting a fixed-length vector from an illegal scalable vector is not yet supported" #55412

May 27 '22 08:05 stevesuzuki-arm

@zvookin Maybe I should rebase this PR on #6786 . Other than that, do you have any updates about your views to this change? It would be appreciated if you give me your thoughts on the question in my previous comment.

Jun 03 '22 19:06 stevesuzuki-arm

See https://github.com/halide/Halide/pull/6802 . It may be useful to put the concat_vectors and reverse_vectors support in this PR, though I believe what those routines do works here by converting for fixed vectors and back for shuffle_vectors. Whether that generates good code is an open question.

I did not put the code that sets the backend specific fixed length vector flags in that PR as I hope it will not be necessary. The PR does assert the vscale range on functions.

Let me know if there are any issues or if this doesn't look workable. Idea is both to factor the pull requests into more reviewable chunks and to make sure the SVE and RISC V Vector support line up together.

Jun 06 '22 00:06 zvookin

(Just catching up on existing PRs...) where does this stand? IIRC there were pieces that we wanted to break out and land separately.

Jun 29 '22 15:06 steven-johnson

My idea for the next step is to rebase this PR on top of other PRs for vscale vectors (#6786 and #6802) and modify the code to align with them (i.e. CodeGen_LLVM/Internal). Sorry for not updating, as I had been away from the work and also was waiting for #6802.

Jun 29 '22 16:06 stevesuzuki-arm

No worries, I also have been away :-)

Jun 29 '22 16:06 steven-johnson

This PR is ready to review. Commits have been rebased to main branch as of 5th July. LLVM which was used for this work is updated to SHA1:43f8a6b74931. I've reorganized the commits so that single topic is implemented in single commit for the reviewers not to need to look across commits.

@zvookin, there are some updates for the topics we have discussed, which are captured in the following and I'm happy to incorporate your feedbacks.

Supported Vector length

This PR supports SVE2 with only 128 bits vector length (vscale=1). Other cases will be added overtime. That said, the changes in CodeGen_LLVM.cpp/h should work also with other vscale values than 1, aiming not to break the ongoing work for 256 bits etc.

Shuffle vectors

As mentioned in this commit, I didn't take the approach of performing shuffle operation in fixed sized vector by adding conversion between scalable vector and fixed sized vector. The reason is, that conversion results in LLVM Error for most of the cases except for natural size vector. Even if that error is fixed at some point, it would be only possible via load/store memory, which would presumably be poor performance.

Helper APIs for scalable vector code-gen

As mentioned in this commit, APIs in #6802 are modified so that effective_vscale is taken into account implicitly in APIs of CodeGen_LLVM. More specifically, APIs of CodeGen_Internal requires explicit argument of effective_vscale, while APIs of CodeGen_LLVM doesn't have that argument and use the cached one in member variable, aiming to make caller code simple.

Jul 05 '22 15:07 stevesuzuki-arm

https://github.com/llvm/llvm-project/issues/56431

Jul 07 '22 20:07 zvookin

Monday Morning Review Ping -- where does this PR stand?

Aug 22 '22 16:08 steven-johnson

@steven-johnson I'm waiting for review feedback and approval if I understand the situation right. If there is anything on my side that could accelerate the review to go through, I would appreciate it.

Aug 22 '22 20:08 stevesuzuki-arm

Recent changes in main branch have been incorporated. Test results were the same as before.

Sep 15 '22 18:09 stevesuzuki-arm

I missed the update for a while, but now I'm more than happy to see this landed finally! Many thanks for all the efforts to make this happen🎉

Mar 28 '24 16:03 stevesuzuki-arm

Enable Arm SVE2 for 128 bits vector target

TL;DR

For Halide Users

What is this for

Usage

For Halide Maintainers

Vector Length Agnostic concept of SVE

Shuffle Vector

Unsupported peep-hole patterns in SVE2

Workaround for LLVM issues

Refactoring of unit tests for Arm SIMD

CMake Tests on emulator

Future work

Appendix 1) Test results

Setup

Result

Target with SVE2

Target without SVE2

Appendix 2) LLVM Issues

Supported Vector length

Shuffle vectors

Helper APIs for scalable vector code-gen