highway icon indicating copy to clipboard operation
highway copied to clipboard

Optimizations to FirstN for the lane size == 1, vector size == 512 case on 32-bit AVX-512 targets

Open johnplatts opened this issue 3 years ago • 0 comments

Here is the current implementation of FirstN for 32-bit AVX-512 targets for the lane size == 1, vector size == 512 bits case: https://github.com/google/highway/blob/22e3d7276f4157d4a47586ba9fd91dd6303f441a/hwy/ops/x86_512-inl.h#L476-L480

Here is a better FirstN implementation of the above function for 32-bit AVX-512 targets:

#if HWY_COMPILER_MSVC >= 1920 || HWY_COMPILER_GCC >= 900 || HWY_COMPILER_CLANG || HWY_COMPILER_ICC
template <typename T, HWY_IF_LANE_SIZE(T, 1)> 
HWY_INLINE Mask512<T> FirstN(size_t n) {
  uint32_t loMask;
  uint32_t hiMask;
  uint32_t hiMaskOutLen;
  #if HWY_COMPILER_GCC || HWY_COMPILER_CLANG
  if(__builtin_constant_p(n >= 32) && n >= 32) {
    if(__builtin_constant_p(n >= 64) && n >= 64)
      hiMaskOutLen = 32u;
    else
      hiMaskOutLen = ((n <= 287) ? static_cast<uint32_t>(n) : 287u) - 32u;
    
    loMask = hiMask = 0xFFFFFFFFu;
  } else
  #endif
  {
    const uint32_t maskOutLen = (n <= 255) ? static_cast<uint32_t>(n) : 255u;
    loMask = _bzhi_u32(0xFFFFFFFFu, maskOutLen);
    
    #if HWY_COMPILER_GCC || HWY_COMPILER_CLANG
    if(__builtin_constant_p(maskOutLen <= 32) && maskOutLen <= 32)
      return Mask512<T>{static_cast<__mmask64>(loMask)};
    #endif
    
    _addcarry_u32(_subborrow_u32(0, maskOutLen, 32u, &hiMaskOutLen),
      0xFFFFFFFFu, 0u, &hiMask);
  }
  hiMask = _bzhi_u32(hiMask, hiMaskOutLen);
  #if (HWY_COMPILER_GCC && !HWY_COMPILER_ICC) || HWY_COMPILER_CLANG
  if(__builtin_constant_p((static_cast<uint64_t>(hiMask) << 32) | loMask))
  #endif
   return Mask512<T>{static_cast<__mmask64>((static_cast<uint64_t>(hiMask) << 32) | loMask)};
  #if (HWY_COMPILER_GCC && !HWY_COMPILER_ICC) || HWY_COMPILER_CLANG
  else
    return Mask512<T>{_mm512_kunpackd(static_cast<__mmask64>(hiMask), static_cast<__mmask64>(loMask))};
  #endif
}
#else
 template <typename T, HWY_IF_LANE_SIZE(T, 1)> 
 HWY_INLINE Mask512<T> FirstN(size_t n) { 
   const uint64_t bits = n < 64 ? ((1ULL << n) - 1) : ~uint64_t{0}; 
   return Mask512<T>{static_cast<__mmask64>(bits)}; 
 } 
#endif

The second implementation of FirstN for the lane size == 1, vector size == 512 bits case generates fewer instructions on AVX-512 targets than the current implementation at the -O2 optimization level.

The second implementation of FirstN should not be used if HWY_COMPILER_MSVC && HWY_COMPILER_MSVC < 1920 is true as x86 Visual C++ compiler versions earlier than 19.20 have a compiler bug that generates incorrect code.

The second implementation of FirstN also requires the _addcarry_u32 and _subborrow_u32 intrinsics that are available on Visual C++ 2015 or later, GCC 9.0 or later, ICC, or Clang 3.6 or later.

In the second implementation of FirstN for the lane size == 1, vector size == 512 bits case on AVX-512 targets:

  • hiMaskOutLen (which is equal to maskOutLen - 32) is computed using the _subborrow_u32 intrinsic as we want to compute the carry flag of maskOutLen - 32 (which is 1 if maskOutLen < 32 is true and 0 if maskOutLen >= 32 is true) in addition to maskOutLen - 32
  • hiMask is initialized using the _addcarry_u32 intrinsic, which will initialize hiMask to 0xFFFFFFFF if maskOutLen >= 32 is true and will set hiMask to 0 if maskOutLen < 32 is true
  • We only care about the value of hiMaskOutLen if maskOutLen < 32 is true as hiMask will be initialized to 0 if maskOutLen < 32 is true
  • hiMask = _bzhi_u32(hiMask, hiMaskOutLen) behaves as follows:
    • If maskOutLen < 32 is true, hiMask == 0 will still be true
    • If maskOutLen >= 32 && maskOutLen < 64 is true, then hiMask will now be equal to (1ULL << (maskOutLen - 32)) - 1
    • If maskOutLen >= 64 is true, then hiMask == 0xFFFFFFFF will still be true
  • __builtin_constant_p is used on GCC, Clang, and ICC to facilitate additional optimizations when optimizations are enabled

johnplatts avatar Aug 25 '22 15:08 johnplatts