OpenBLAS risc-v vector v1.0 support

The discussions in #4049 inspire me to creat an issue for further discussions. Differ from commercial ISAs, which have a clear development plan, the total amount of products supporting rvv may be large. Optimization for all individual products may lead to code bloat, and is contrary to the purpose of the vector isa, which is expected to be length-adaptive. Until now the intrinsic spec of rvv 1.0 is stable enough to develop codes, and the support of rvv 1.0 has been fully submitted to openblas, based on sifive x280, an in-order cpu with vlen=512. Would it be better to do more development, based on this x280 version? The final destination may be the compatibility in different vlen, instruction execution order, tail/mask policy. Of course the pursuing of compatibility may lead to suboptimum performance, a balance have to be considered. There are some cpu specified features in kernels of x280 and may lead to incorrect results in other cpus. List as following

Architecture specified cflags, such as -riscv-v-vector-bits-min=512 and -ffast-math.
Changing vl in a loop, leading to tail cleared without tail undisturbed setted. Such as vl = VSETVL(k); in symv_L_rvv.c, line 96.
Set vl by immediate value under the assumption of vlen=512. Such as size_t vl = 8; in gemm_tcopy_8_rvv.c, line 84.

In addition to above, the registers tiling in gemm of different vlen should be considered. Now we set GEMM_UNROLL_N_SHIFT 8, which may waste other vector registers. 12 or 14 may be better?

May 22 '23 08:05 sh-zheng

And I would like to make some PRs to solve the problem 2 firstly. There are three ways to relieve the implicit tail undisturbed assumptions in in-place accumulations.

Split the loop into the main-loop and the remainder, and make reductions twice.
Use explicit _tu suffix in the remainder accumulations.
Generate mask with all 0 in tail components, and protect the tail elements in remainder accumulations.

Maybe the explicit _tu suffix leads to the least code changes and is the best choice.

May 22 '23 15:05 sh-zheng

Not sure what happened to the lengthy comment I wrote yesterday ... anyway #3808 is related , and I'm not entirely sure the intrinsics API is "stable enough to develop codes" already. In general, it seems more important to me to have working, performant kernels for actual available hardware. A more generally usable version could then be derived from one such set, be it x280 or something else, at a later date (something like the ARMV8SVE target in aarch64). If you would like to start this now I'm fine with it of course, but there may be a few rounds of fixes necessary later if the APIs continue to change. (Also it probably makes little sense to try to optimize to a virtual target provided by qemu, unless there is actual hardware with that characteristics coming to market) We do have the RISCV64GENERIC target for otherwise unsupported hardware, and while this does not support RVV in itself, compiler optimization may include autovectorization at some point. In any case I do not think it would be a good idea to change the existing x280 kernels as such - especially if you plan to introduce code that is unnecessary on the x280. If anything, copy them to new names, and apply your changes in the copy.

May 23 '23 11:05 martin-frbg

I agree. Indeed we should keep the x280 kernels unchanged and apply changes in a new copy. Now the isa spec of rvv has been stable, but the intrinsic spec has not. So it may be better to wait a little longer until the intrinsic spec be stable. I will follow the progress of the spec.

May 23 '23 13:05 sh-zheng

@sh-zheng, our intent when creating this rvv.c versions using the C intrinsics was that it adhere to the risc-v vector 1.0 support and not be x280 specific. However, we did want to fully exploit the available RVV 1.0 features. We had the same quandary as you did when we initially reviewed the _vector.c implementations, and then chose to create a separate implementation. Perhaps in time this can all converge. We will continue to update as the C intrinsic spec is finalized with hopefully not too many more changes. I'm aware of one more change forthcoming to the segment load/store intrinsics.

Our intent was that the implementation should work for various VLENs however as you've pointed out in #3 (vl=8) that in some cases we've tried to maximize the performance given the x280's VLEN=512. So, some alternate implementations may be required there. GEMM is such an important performance metric and we tried to maximize, without reverting to assembler. While we developed in QEMU, we tested in hardware emulation to measure actual performance on real RTL and validate correctness, since we do not yet have silicon. Once we do have silicon we will do a full performance/test sweep again.

For the compiler flags noted in #1, those are used by the clang/llvm auto-vectorizer on the generic C code and should not have an impact on the C intrinsic code. More recently I believe those flags are being standardized through the use of 'zvl512b' in the -march string.

One point #2 'Changing vl in a loop, leading to tail cleared without tail undisturbed' I'm a little surprised that this would be x280 specific and not RVV standard. I'll go review. To me it would seem like a flaw if it is not prescribed in the standard?

BTW, thanks for your work on FFTW for risc-v. We've used your PR directly :).

May 29 '23 15:05 ken-unger

@ken-unger , thank you very much for explaining some details of the design. I think the design intention of vector-isa, is to operate the vectors in their "natural" length. So if we have to use strip mining to operate long vector, it should be order-independent. In fact the rvv spec couldn't even garantee that vl = VLMAX when AVL is between VLMAX and 2 * VLMAX, in 6.3 of spec. So the underlength vector may occur at the penultimate loop, not just at the last loop. In-placing operation of a vector register here may be overwrited without tail undisturbed setted.

May 30 '23 09:05 sh-zheng

Hello @sh-zheng, we are also interested in a generic RVV implementation. Tested the x280 target with VLEN=128, noticed that the level 3 tests are failing, which is expected based on the discussion on this thread. Also saw some failing utests for which I think I found a fix (#4125). If help is needed we'd be open to contribute on providing some generic level 3 implementations.

Jul 05 '23 11:07 OMaghiarIMG

I believe this is addressed now through the additional RISCV64_ZVL256B and RISCV64_ZVL128B targets that got merged from the risc-v branch ?

Feb 04 '24 09:02 martin-frbg

Yes, PR #4159 has updated the tail policy. Now almost all kernels have supported rvv, except for *trsm and c/zsymv kernels. Another issue is that, if we can merge ZVL256B and ZVL128B, since there is no essential gap between them?

Feb 06 '24 12:02 sh-zheng

I'm not familiar enough with RISC-V, aren't at least the current GEMM kernels for ZVL256B and ZVL128B different (beyond using different unrolling) ?

Feb 06 '24 13:02 martin-frbg

Yes they are different realizations now, at least using different unrolling. Even so maybe it's better if they are in a unified framework in future, regardless of vlen.

Feb 06 '24 13:02 sh-zheng