highway Updated F32 to BF16 demote ops and BF16FromF32

Updated BF16FromF32 implementation in hwy/base.h to round the F32 value to the nearest BF16 value (with ties to even) instead of simply truncating the F32 value to a BF16 to match the behavior of static_cast<hwy::bfloat16_t::Native>(f32) (on compilers supporting a static cast of a F32 to hwy::bfloat16_t::Native), AVX512BF16 vcvtneps2bf16, AArch64 BFCVT, and Power10 xvcvspbf16.

Reimplemented F32 to BF16 DemoteTo/ReorderDemote2To/OrderedDemote2To on all targets to round the F32 value to the nearest BF16 value instead of simply truncating a F32 value to a BF16 value.

Reimplemented F32 to BF16 DemoteTo/ReorderDemote2To/OrderedDemote2To on AVX3_ZEN4 and AVX3_SPR to use the _mm*_cvtneps_pbh and _mm*_cvtne2ps_pbh intrinsics if available.

Reimplemented F32 to F16 DemoteTo/ReorderDemote2To/OrderedDemote2To on PPC10 to use the __builtin_vsx_xvcvspbf16 intrinsic if available.

Reimplemented F32 to F16 DemoteTo/ReorderDemote2To/OrderedDemote2To on SVE to use the svcvt_bf16_f32_x intrinsic if the SVE_BF16 extension is available.

Also fixed compilation errors on SVE if the SVE BF16 extensions are enabled.

Also moved the F32 to BF16 DemoteTo/ReorderDemote2To/OrderedDemote2To implementation for targets other than SCALAR, EMU128, AVX3_ZEN4, AVX3_SPR, PPC10, RVV, and SVE to generic_ops-inl.h.

Also added checks for the AVX512BF16 instruction set extension in hwy/targets.cc to ensure that the AVX3_ZEN4 and AVX3_SPR targets are only enabled if the AVX512BF16 extension is available (which is supported on Zen 4 and Sapphire Rapids CPU's that have AVX512 support).

Also updated hwy/ops/set-macros_inl.h to include avx512bf16 in the target string on the AVX3_ZEN4 and AVX3_SPR targets for compilers that support the _mm*_cvtneps_pbh and _mm*_cvtne2ps_pbh intrinsics.

Jan 27 '24 23:01 johnplatts

I see that CI prevented the merge here: there's a clang internal compiler error for the Zen4 target. BF16 compiler support has really been half-baked :/

fatal error: error in backend: Cannot select: 0x370e3cab1380: v32bf16 = insert_subvector undef:v32bf16, 0x370e3cab1ee0, Constant:i64<0>
  0x370e3bbb7930: v32bf16 = undef
  0x370e3cab1ee0: v8bf16 = llvm.x86.vcvtneps2bf16256 TargetConstant:i64<14306>, 0x370e3cfcc9a0
    0x370e3cf644d0: i64 = TargetConstant<14306>
    0x370e3cfcc9a0: v8f32,ch = load<(load (s256) from constant-pool)> 0x370e39bd0b60, 0x370e3cab19a0, undef:i64
      0x370e3cab19a0: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
        0x370e36483620: i64 = TargetConstantPool<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
      0x370e3627fee0: i64 = undef
  0x370e35534a80: i64 = Constant<0>
In function: _ZN3hwy11N_AVX3_ZEN423TestDup128VecFromValuesclINS_10bfloat16_tENS0_4SimdIS3_Lm32ELi0EEEEEvT_T0_

Any ideas for a potential workaround?

Feb 17 '24 11:02 jan-wassenberg

I see that CI prevented the merge here: there's a clang internal compiler error for the Zen4 target. BF16 compiler support has really been half-baked :/

fatal error: error in backend: Cannot select: 0x370e3cab1380: v32bf16 = insert_subvector undef:v32bf16, 0x370e3cab1ee0, Constant:i64<0>
  0x370e3bbb7930: v32bf16 = undef
  0x370e3cab1ee0: v8bf16 = llvm.x86.vcvtneps2bf16256 TargetConstant:i64<14306>, 0x370e3cfcc9a0
    0x370e3cf644d0: i64 = TargetConstant<14306>
    0x370e3cfcc9a0: v8f32,ch = load<(load (s256) from constant-pool)> 0x370e39bd0b60, 0x370e3cab19a0, undef:i64
      0x370e3cab19a0: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
        0x370e36483620: i64 = TargetConstantPool<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
      0x370e3627fee0: i64 = undef
  0x370e35534a80: i64 = Constant<0>
In function: _ZN3hwy11N_AVX3_ZEN423TestDup128VecFromValuesclINS_10bfloat16_tENS0_4SimdIS3_Lm32ELi0EEEEEvT_T0_

Any ideas for a potential workaround?

The issue is that there is a compiler bug in some pre-release versions of Clang 18/19 with AVX512BF16 enabled, and a snippet that demonstrates the Clang 18/19 compiler bug with AVX512BF16 enabled can be found at https://godbolt.org/z/zPrETK633.

Any other fixes that need to be made to get the CI checks to succeed?

Feb 24 '24 00:02 johnplatts

Thank you for continuing to debug this :) Yes, unfortunately we still have an issue:

fatal error: error in backend: Cannot select: 0x340bb7ffb5b0: v32bf16 = insert_subvector undef:v32bf16, 0x340bbbfaeee0, Constant:i64<0>
  0x340bbc553930: v32bf16 = undef
  0x340bbbfaeee0: v8bf16 = llvm.x86.vcvtneps2bf16256 TargetConstant:i64<14331>, 0x340bbbfaee70
    0x340bbcca24d0: i64 = TargetConstant<14331>
    0x340bbbfaee70: v8f32,ch = load<(load (s256) from constant-pool)> 0x340bb2bd87e0, 0x340bbbfae9a0, undef:i64
      0x340bbbfae9a0: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
        0x340bb7ffb620: i64 = TargetConstantPool<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
      0x340bbcd71ee0: i64 = undef
  0x340bb6542150: i64 = Constant<0>
In function: _ZN3hwy11N_AVX3_ZEN423TestDup128VecFromValuesclINS_10bfloat16_tENS0_4SimdIS3_Lm32ELi0EEEEEvT_T0_

The repro is shockingly simple, I wonder how that made it to release. Would you like to file an LLVM issue for it? Seems like a showstopper.

Feb 24 '24 03:02 jan-wassenberg

Thank you for continuing to debug this :) Yes, unfortunately we still have an issue:

fatal error: error in backend: Cannot select: 0x340bb7ffb5b0: v32bf16 = insert_subvector undef:v32bf16, 0x340bbbfaeee0, Constant:i64<0>
  0x340bbc553930: v32bf16 = undef
  0x340bbbfaeee0: v8bf16 = llvm.x86.vcvtneps2bf16256 TargetConstant:i64<14331>, 0x340bbbfaee70
    0x340bbcca24d0: i64 = TargetConstant<14331>
    0x340bbbfaee70: v8f32,ch = load<(load (s256) from constant-pool)> 0x340bb2bd87e0, 0x340bbbfae9a0, undef:i64
      0x340bbbfae9a0: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
        0x340bb7ffb620: i64 = TargetConstantPool<<8 x float> <float -1.600000e+01, float -1.500000e+01, float -1.400000e+01, float -1.300000e+01, float -1.200000e+01, float -1.100000e+01, float -1.000000e+01, float -9.000000e+00>> 0
      0x340bbcd71ee0: i64 = undef
  0x340bb6542150: i64 = Constant<0>
In function: _ZN3hwy11N_AVX3_ZEN423TestDup128VecFromValuesclINS_10bfloat16_tENS0_4SimdIS3_Lm32ELi0EEEEEvT_T0_

With what version of Clang is the above error happening with as the above compiler error does not occur with Clang 9.0.1, 10.0.0, 11.0.0, 12.0.0, 13.0.1, 14.0.0, 15.0.7, 16.0.6, or 17.0.6?

Feb 24 '24 11:02 johnplatts

Our internal clang is quite close to trunk. Unfortunately it does not come with an updated version number. Would it help to use our HWY_COMPILER_CLANG to deduce the major version number? That would probably only tell us >= 17.

Feb 25 '24 09:02 jan-wassenberg

Our internal clang is quite close to trunk. Unfortunately it does not come with an updated version number. Would it help to use our HWY_COMPILER_CLANG to deduce the major version number? That would probably only tell us >= 17.

What are the versions of the clang compiler (clang -v) that are still causing the CI to fail in this pull request?

I have tried disabling AVX512BF16 if HWY_COMPILER3_CLANG is 160000, 170000, 180000, or 190000 to work around AVX512BF16 codegen errors in pre-release Clang versions. I am also not encountering any compiler errors when I build Highway for the AVX3_ZEN4 or AVX3_SPR targets with the changes in this pull request with Clang 16.0.6 or Clang 17.0.6 on Ubuntu 22.04 using the clang-16 and clang-17 packages in the LLVM apt repository.

Feb 28 '24 19:02 johnplatts

What is still causing the CI checks to fail in this pull request?

Feb 29 '24 11:02 johnplatts

What are the versions of the clang compiler (clang -v) that are still causing the CI to fail in this pull request?

Unfortunately I do not have direct access to the builder machines that compile this code. They also deliberately prevent version checks by reporting version 9999, last I checked.

I admire your efforts in remote-debugging here, which can only be called heroic :) Your asm version did the trick, the CI failure is actually not us but an unrelated (flaky) test of an internal Highway user. Hopefully we'll get that straightened out by early next week, then we can merge :D

Feb 29 '24 11:02 jan-wassenberg

It turns out this failing test actually uses Highway and the bf16 rounding change seems to trigger the failure. Having a closer look..

Mar 20 '24 11:03 jan-wassenberg

I have patched the internal test so that it expects bf16 rounding. This is now under internal review.

Mar 20 '24 12:03 jan-wassenberg