mldsa-native
mldsa-native copied to clipboard
Test out mulhrs vs add+shift in decompose
AVX2 doesn't have rounding right-shift (e.g. URSHR in Neon). Instead, it was simulated using "mulhrs with a power of 2" in decompose (here for example). While this only need one instruction, it's likely that add+shift is still faster.
See https://github.com/pq-code-package/mldsa-native/pull/629#discussion_r2508945174 for more context.