mldsa-native Low performance in mld

Due to bit interleaving, repeated calls to keccak_absorb have much lower performance if they are not 64-bit aligned chunks. This suggests that it might be worth buffering between calls. The only place that this is relevant in the current codebase is in mld_H

An alternative would be to extend the keccak state to 26 64-bit integers, where s[25] is a temporary buffer, however this introduces new edge cases.

I suggest that we modify mld_H to only call absorb in 8-byte chunks.

Sep 10 '25 10:09 bremoran

Nice idea, may cause a little pain in the CBMC proof, but I'm interested to see the performance. Could you elaborate on the bit interleaving performance issue you mentioned? I'm trying to understand exactly what causes the performance degradation when keccak_absorb calls aren't using 64-bit aligned chunks.

Looking at mld_H, it currently makes separate calls to shake256_absorb() for each input buffer. Is concatenating all the inputs into a single 64-bit aligned buffer and then making one shake256_absorb() the optimization you suggest here?

Sep 18 '25 23:09 jakemas

@jakemas Sure, first some background:

The issue is that 32-bit cores don't typically have the 64-bit rotate that's required by the keccak algorithm. Doing a 64-bit rotate using 32-bit instructions typically requires 6 instructions: 2 x (left shift, right shift, or). Doing the same on an Armv7-M core with the flexible second operand reduces this total to 4 instructions: 2x(left shift, or with right-shifted operand). But, if we use bit interleaving then the rotate instructions just disappear due to that flexible second operand.

Here's the XKCP code that shows the bit interleaving concept: https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/ref-32bits/KeccakP-1600-reference32BI.c Here's a reference in Armv7-M assembly: https://github.com/XKCP/XKCP/blob/master/lib/low/KeccakP-1600/ARM/KeccakP-1600-inplace-32bi-armv7m-le-armcc.s And here's the version that removes most rotation instructions by using rotated operands: https://github.com/slothy-optimizer/slothy/blob/main/examples/naive/armv7m/keccak/keccakf1600_adomnicai_m4.s

So this approach should cut the cost of one keccak round by (5+25 lanes) * 4 instructions = 120 cycles, and one keccak state permute by 2880 cycles. In practice, it's slightly worse than this.

This sounds great except that the process to enter bit interleaving costs 37 instructions per lane and exiting bit interleaving costs 36 instructions per lane.

So, every time we xor a partial lane into the state, it costs an extra 37 instructions. Every time we read less than a full lane out of the state, it costs an extra 36 instructions.

The idea here is that we always make sure to do a full lane xor.

Sep 19 '25 08:09 bremoran

Low performance in mld_H