highway Add support for SaturatedAdd and SaturatedSub for int32_t, uint32_t, int64_t, and uint64

Some SIMD instruction sets such as NEON, SVE, and RVV have instructions for saturated addition/subtraction of int32_t, uint32_t, int64_t, and uint64_t SIMD vectors.

AltiVec also has instructions for saturated addition/subtraction of int32_t and uint32_t SIMD vectors.

Here is how int32_t, uint32_t, int64_t, and uint64_t SaturatedAdd and SaturatedSub can be implemented for x86, WASM, and HWY_EMU128 targets:

template <size_t N>
HWY_API Vec128<uint32_t, N> SaturatedAdd(const Vec128<uint32_t, N> a,
                                         const Vec128<uint32_t, N> b) {
  return Add(a, Min(b, Not(a)));
}
template <size_t N>
HWY_API Vec128<uint64_t, N> SaturatedAdd(const Vec128<uint64_t, N> a,
                                         const Vec128<uint64_t, N> b) {
  return Add(a, Min(b, Not(a)));
}
template <size_t N>
HWY_API Vec128<uint32_t, N> SaturatedSub(const Vec128<uint32_t, N> a,
                                         const Vec128<uint32_t, N> b) {
  return Sub(Max(a, b), b);
}
template <size_t N>
HWY_API Vec128<uint64_t, N> SaturatedSub(const Vec128<uint64_t, N> a,
                                         const Vec128<uint64_t, N> b) {
  return Sub(Max(a, b), b);
}

template <size_t N>
HWY_API Vec128<int32_t, N> SaturatedAdd(const Vec128<int32_t, N> a,
                                        const Vec128<int32_t, N> b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  
  const Vec128<int32_t, N> result = Add(a, b);
  const Vec128<int32_t, N> overflowMaskVect =
    BroadcastSignBit(And(Xor(a, result), Xor(b, result)));
  const Mask128<int32_t, N> overflowMask =
    MaskFromVec(overflowMaskVect);
  const Vec128<int32_t, N> overflowResult = BitCast(d, Add(
    ShiftRight<1>(BitCast(du, overflowMaskVect)),
    ShiftRight<31>(BitCast(du, a))));
  
  return IfThenElse(overflowMask, overflowResult, result);
}

template <size_t N>
HWY_API Vec128<int64_t, N> SaturatedAdd(const Vec128<int64_t, N> a,
                                        const Vec128<int64_t, N> b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  
  const Vec128<int64_t, N> result = Add(a, b);
  const Vec128<int64_t, N> overflowMaskVect =
    BroadcastSignBit(And(Xor(a, result), Xor(b, result)));
  const Mask128<int64_t, N> overflowMask =
    MaskFromVec(overflowMaskVect);
  const Vec128<int64_t, N> overflowResult = BitCast(d, Add(
    ShiftRight<1>(BitCast(du, overflowMaskVect)),
    ShiftRight<63>(BitCast(du, a))));
  
  return IfThenElse(overflowMask, overflowResult, result);
}

template <size_t N>
HWY_API Vec128<int32_t, N> SaturatedSub(const Vec128<int32_t, N> a,
                                        const Vec128<int32_t, N> b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  
  const Vec128<int32_t, N> result = Sub(a, b);
  const Vec128<int32_t, N> overflowMaskVect =
    BroadcastSignBit(And(Xor(a, b), Xor(a, result)));
  const Mask128<int32_t, N> overflowMask =
    MaskFromVec(overflowMaskVect);
  const Vec128<int32_t, N> overflowResult = BitCast(d, Add(
    ShiftRight<1>(BitCast(du, overflowMaskVect)),
    ShiftRight<31>(BitCast(du, a))));
  
  return IfThenElse(overflowMask, overflowResult, result);
}

template <size_t N>
HWY_API Vec128<int64_t, N> SaturatedSub(const Vec128<int64_t, N> a,
                                        const Vec128<int64_t, N> b) {
  const DFromV<decltype(a)> d;
  const RebindToUnsigned<decltype(d)> du;
  
  const Vec128<int64_t, N> result = Sub(a, b);
  const Vec128<int64_t, N> overflowMaskVect =
    BroadcastSignBit(And(Xor(a, b), Xor(a, result)));
  const Mask128<int64_t, N> overflowMask =
    MaskFromVec(overflowMaskVect);
  const Vec128<int64_t, N> overflowResult = BitCast(d, Add(
    ShiftRight<1>(BitCast(du, overflowMaskVect)),
    ShiftRight<63>(BitCast(du, a))));
  
  return IfThenElse(overflowMask, overflowResult, result);
}

NEON supports 32-bit and 64-bit saturated integral addition with the vqadd_s32, vqadd_s64, vqadd_u32, vqadd_u64, vqaddq_s32, vqaddq_s64, vqaddq_u32, vqaddq_u64 intrinsics. NEON also supports 32-bit and 64-bit saturated integral addition with the vqsub_s32, vqsub_s64, vqsub_u32, vqsub_u64, vqsubq_s32, vqsubq_s64, vqsubq_u32, vqsubq_u64 intrinsics.

Aug 25 '22 19:08 johnplatts

Integer adds (& subs) and muls with overflow checks would also be great. These can be implemented efficiently for some size*architecture combinations. LLVM has generic support for these intrinsics: https://godbolt.org/z/b5MoTWqs1 Unfortunately, the aarch64 code doesn't seem to be more than 128 bits. -aarch64-sve-vector-bits-min=512 resulted in an error when used alongside -mcpu=a64fx.

Aug 25 '22 19:08 chriselrod

Sorry about the delayed reply, I was out on Thu/Fri.

Ah, that's a clever implementation, thanks for sharing :) We'd be happy to add those saturated adds if you or anyone else has a use case planned?

@chriselrod do you also have a use case for overflow-checked adds/muls? If so, I'd also be happy to work together with you towards a pull request for that.

Aug 29 '22 17:08 jan-wassenberg

Ah, that's a clever implementation, thanks for sharing :) We'd be happy to add those saturated adds if you or anyone else has a use case planned?

A use case for 32-bit saturated addition/subtraction is to compute the saturated count of matching elements, saturated to either 2147483647 or 4294967294.

Sep 14 '22 11:09 johnplatts

Thanks again for implementing these!

May 11 '23 08:05 jan-wassenberg

Add support for SaturatedAdd and SaturatedSub for int32_t, uint32_t, int64_t, and uint64_t vector types