Add support for SaturatedAdd and SaturatedSub for int32_t, uint32_t, int64_t, and uint64_t vector types
Some SIMD instruction sets such as NEON, SVE, and RVV have instructions for saturated addition/subtraction of int32_t, uint32_t, int64_t, and uint64_t SIMD vectors.
AltiVec also has instructions for saturated addition/subtraction of int32_t and uint32_t SIMD vectors.
Here is how int32_t, uint32_t, int64_t, and uint64_t SaturatedAdd and SaturatedSub can be implemented for x86, WASM, and HWY_EMU128 targets:
template <size_t N>
HWY_API Vec128<uint32_t, N> SaturatedAdd(const Vec128<uint32_t, N> a,
const Vec128<uint32_t, N> b) {
return Add(a, Min(b, Not(a)));
}
template <size_t N>
HWY_API Vec128<uint64_t, N> SaturatedAdd(const Vec128<uint64_t, N> a,
const Vec128<uint64_t, N> b) {
return Add(a, Min(b, Not(a)));
}
template <size_t N>
HWY_API Vec128<uint32_t, N> SaturatedSub(const Vec128<uint32_t, N> a,
const Vec128<uint32_t, N> b) {
return Sub(Max(a, b), b);
}
template <size_t N>
HWY_API Vec128<uint64_t, N> SaturatedSub(const Vec128<uint64_t, N> a,
const Vec128<uint64_t, N> b) {
return Sub(Max(a, b), b);
}
template <size_t N>
HWY_API Vec128<int32_t, N> SaturatedAdd(const Vec128<int32_t, N> a,
const Vec128<int32_t, N> b) {
const DFromV<decltype(a)> d;
const RebindToUnsigned<decltype(d)> du;
const Vec128<int32_t, N> result = Add(a, b);
const Vec128<int32_t, N> overflowMaskVect =
BroadcastSignBit(And(Xor(a, result), Xor(b, result)));
const Mask128<int32_t, N> overflowMask =
MaskFromVec(overflowMaskVect);
const Vec128<int32_t, N> overflowResult = BitCast(d, Add(
ShiftRight<1>(BitCast(du, overflowMaskVect)),
ShiftRight<31>(BitCast(du, a))));
return IfThenElse(overflowMask, overflowResult, result);
}
template <size_t N>
HWY_API Vec128<int64_t, N> SaturatedAdd(const Vec128<int64_t, N> a,
const Vec128<int64_t, N> b) {
const DFromV<decltype(a)> d;
const RebindToUnsigned<decltype(d)> du;
const Vec128<int64_t, N> result = Add(a, b);
const Vec128<int64_t, N> overflowMaskVect =
BroadcastSignBit(And(Xor(a, result), Xor(b, result)));
const Mask128<int64_t, N> overflowMask =
MaskFromVec(overflowMaskVect);
const Vec128<int64_t, N> overflowResult = BitCast(d, Add(
ShiftRight<1>(BitCast(du, overflowMaskVect)),
ShiftRight<63>(BitCast(du, a))));
return IfThenElse(overflowMask, overflowResult, result);
}
template <size_t N>
HWY_API Vec128<int32_t, N> SaturatedSub(const Vec128<int32_t, N> a,
const Vec128<int32_t, N> b) {
const DFromV<decltype(a)> d;
const RebindToUnsigned<decltype(d)> du;
const Vec128<int32_t, N> result = Sub(a, b);
const Vec128<int32_t, N> overflowMaskVect =
BroadcastSignBit(And(Xor(a, b), Xor(a, result)));
const Mask128<int32_t, N> overflowMask =
MaskFromVec(overflowMaskVect);
const Vec128<int32_t, N> overflowResult = BitCast(d, Add(
ShiftRight<1>(BitCast(du, overflowMaskVect)),
ShiftRight<31>(BitCast(du, a))));
return IfThenElse(overflowMask, overflowResult, result);
}
template <size_t N>
HWY_API Vec128<int64_t, N> SaturatedSub(const Vec128<int64_t, N> a,
const Vec128<int64_t, N> b) {
const DFromV<decltype(a)> d;
const RebindToUnsigned<decltype(d)> du;
const Vec128<int64_t, N> result = Sub(a, b);
const Vec128<int64_t, N> overflowMaskVect =
BroadcastSignBit(And(Xor(a, b), Xor(a, result)));
const Mask128<int64_t, N> overflowMask =
MaskFromVec(overflowMaskVect);
const Vec128<int64_t, N> overflowResult = BitCast(d, Add(
ShiftRight<1>(BitCast(du, overflowMaskVect)),
ShiftRight<63>(BitCast(du, a))));
return IfThenElse(overflowMask, overflowResult, result);
}
NEON supports 32-bit and 64-bit saturated integral addition with the vqadd_s32, vqadd_s64, vqadd_u32, vqadd_u64, vqaddq_s32, vqaddq_s64, vqaddq_u32, vqaddq_u64 intrinsics. NEON also supports 32-bit and 64-bit saturated integral addition with the vqsub_s32, vqsub_s64, vqsub_u32, vqsub_u64, vqsubq_s32, vqsubq_s64, vqsubq_u32, vqsubq_u64 intrinsics.
Integer adds (& subs) and muls with overflow checks would also be great. These can be implemented efficiently for some size*architecture combinations.
LLVM has generic support for these intrinsics:
https://godbolt.org/z/b5MoTWqs1
Unfortunately, the aarch64 code doesn't seem to be more than 128 bits. -aarch64-sve-vector-bits-min=512 resulted in an error when used alongside -mcpu=a64fx.
Sorry about the delayed reply, I was out on Thu/Fri.
Ah, that's a clever implementation, thanks for sharing :) We'd be happy to add those saturated adds if you or anyone else has a use case planned?
@chriselrod do you also have a use case for overflow-checked adds/muls? If so, I'd also be happy to work together with you towards a pull request for that.
Ah, that's a clever implementation, thanks for sharing :) We'd be happy to add those saturated adds if you or anyone else has a use case planned?
A use case for 32-bit saturated addition/subtraction is to compute the saturated count of matching elements, saturated to either 2147483647 or 4294967294.
Thanks again for implementing these!