Optimizations to FirstN for the lane size == 1, vector size == 512 case on 32-bit AVX-512 targets
Here is the current implementation of FirstN for 32-bit AVX-512 targets for the lane size == 1, vector size == 512 bits case: https://github.com/google/highway/blob/22e3d7276f4157d4a47586ba9fd91dd6303f441a/hwy/ops/x86_512-inl.h#L476-L480
Here is a better FirstN implementation of the above function for 32-bit AVX-512 targets:
#if HWY_COMPILER_MSVC >= 1920 || HWY_COMPILER_GCC >= 900 || HWY_COMPILER_CLANG || HWY_COMPILER_ICC
template <typename T, HWY_IF_LANE_SIZE(T, 1)>
HWY_INLINE Mask512<T> FirstN(size_t n) {
uint32_t loMask;
uint32_t hiMask;
uint32_t hiMaskOutLen;
#if HWY_COMPILER_GCC || HWY_COMPILER_CLANG
if(__builtin_constant_p(n >= 32) && n >= 32) {
if(__builtin_constant_p(n >= 64) && n >= 64)
hiMaskOutLen = 32u;
else
hiMaskOutLen = ((n <= 287) ? static_cast<uint32_t>(n) : 287u) - 32u;
loMask = hiMask = 0xFFFFFFFFu;
} else
#endif
{
const uint32_t maskOutLen = (n <= 255) ? static_cast<uint32_t>(n) : 255u;
loMask = _bzhi_u32(0xFFFFFFFFu, maskOutLen);
#if HWY_COMPILER_GCC || HWY_COMPILER_CLANG
if(__builtin_constant_p(maskOutLen <= 32) && maskOutLen <= 32)
return Mask512<T>{static_cast<__mmask64>(loMask)};
#endif
_addcarry_u32(_subborrow_u32(0, maskOutLen, 32u, &hiMaskOutLen),
0xFFFFFFFFu, 0u, &hiMask);
}
hiMask = _bzhi_u32(hiMask, hiMaskOutLen);
#if (HWY_COMPILER_GCC && !HWY_COMPILER_ICC) || HWY_COMPILER_CLANG
if(__builtin_constant_p((static_cast<uint64_t>(hiMask) << 32) | loMask))
#endif
return Mask512<T>{static_cast<__mmask64>((static_cast<uint64_t>(hiMask) << 32) | loMask)};
#if (HWY_COMPILER_GCC && !HWY_COMPILER_ICC) || HWY_COMPILER_CLANG
else
return Mask512<T>{_mm512_kunpackd(static_cast<__mmask64>(hiMask), static_cast<__mmask64>(loMask))};
#endif
}
#else
template <typename T, HWY_IF_LANE_SIZE(T, 1)>
HWY_INLINE Mask512<T> FirstN(size_t n) {
const uint64_t bits = n < 64 ? ((1ULL << n) - 1) : ~uint64_t{0};
return Mask512<T>{static_cast<__mmask64>(bits)};
}
#endif
The second implementation of FirstN for the lane size == 1, vector size == 512 bits case generates fewer instructions on AVX-512 targets than the current implementation at the -O2 optimization level.
The second implementation of FirstN should not be used if HWY_COMPILER_MSVC && HWY_COMPILER_MSVC < 1920 is true as x86 Visual C++ compiler versions earlier than 19.20 have a compiler bug that generates incorrect code.
The second implementation of FirstN also requires the _addcarry_u32 and _subborrow_u32 intrinsics that are available on Visual C++ 2015 or later, GCC 9.0 or later, ICC, or Clang 3.6 or later.
In the second implementation of FirstN for the lane size == 1, vector size == 512 bits case on AVX-512 targets:
-
hiMaskOutLen(which is equal tomaskOutLen - 32) is computed using the_subborrow_u32intrinsic as we want to compute the carry flag ofmaskOutLen - 32(which is 1 ifmaskOutLen < 32is true and 0 ifmaskOutLen >= 32is true) in addition tomaskOutLen - 32 -
hiMaskis initialized using the_addcarry_u32intrinsic, which will initializehiMaskto 0xFFFFFFFF ifmaskOutLen >= 32is true and will sethiMaskto 0 ifmaskOutLen < 32is true - We only care about the value of
hiMaskOutLenifmaskOutLen < 32is true ashiMaskwill be initialized to 0 ifmaskOutLen < 32is true -
hiMask = _bzhi_u32(hiMask, hiMaskOutLen)behaves as follows:- If
maskOutLen < 32is true,hiMask == 0will still be true - If
maskOutLen >= 32 && maskOutLen < 64is true, thenhiMaskwill now be equal to(1ULL << (maskOutLen - 32)) - 1 - If
maskOutLen >= 64is true, thenhiMask == 0xFFFFFFFFwill still be true
- If
-
__builtin_constant_pis used on GCC, Clang, and ICC to facilitate additional optimizations when optimizations are enabled