regalloc2 icon indicating copy to clipboard operation
regalloc2 copied to clipboard

Replace scratch register with XOR swap

Open bwburnsides opened this issue 1 year ago • 2 comments

I'm getting up to snuff on compiler backend details for a personal project and the regalloc2 overview and the follow on article have been great.

I'm not going to pretend to completely understand the details here, but the MachineEnv requires a scratch register for what I gather is typically swaps, symbolically like this.

let x = 4;
let y = 5;

let swap = x;
x = y;
y = swap;

However there is a way to swap two values without introducing a third variable/register which is via an XOR swap:

x = y ^ x;
y = x ^ y;
x = y ^ x;

You shouldn't do this in high level code but should be fine in machine code of course. If an ISA allows integer operations to be applied to float registers then this can be used for both classes.

I'm nearly certain you know of this already, but I wanted to get it out there for the off chance, and if nothing else get details on why it wouldn't work.

bwburnsides avatar Dec 21 '24 00:12 bwburnsides

Thanks for the comment, @bwburnsides!

Yes, indeed, we're aware of this trick; it's not implemented in RA2 for a combination of reasons:

  • There is a layered approach to performing register moves and using a dedicated scratch is the last resort; and actually the scratch is optional:
    • First, if there are no cycles in a set of moves (a "parallel move"), we can shift values along linear move chains without any temporary space.
    • Then, if there is a move cycle but there is any register in the right register class free, we can use that register (and we opportunistically look for one).
    • Then, we can borrow a register that doesn't participate in a move cycle by spilling its contents to a spare spillslot (which we dynamically create on need).
    • Then, finally, we can use a dedicated scratch; this last case is quite rare.
  • The three-XOR trick obfuscates the data flow from the point of view of register renaming hardware: while it can track physical values through moves and wire up instruction dependencies appropriately in the out-of-order window, arithmetic properties of XORs are beyond the reach of the high-speed move-detection circuitry in renaming (as far as I know). This thus reduces performance (relative to a sequence of true moves) with false dependencies.
  • On x86, we could use xchg; this is a more promising route if we have a two-register move cycle with no spare space. We haven't pursued this.

So, tl;dr -- not much need, and dubious performance implications. You're welcome to experiment with this of course if you find a benchmark where the need appears!

cfallin avatar Dec 21 '24 01:12 cfallin

I thought very carefully about this a year ago or so and, while I've forgotten most of what I figured out, I remember concluding that swaps didn't feel worth adding to the parallel move algorithm in RA2.

A few notes to add to what Chris said, that I can reconstruct quickly, at least:

  • The x86 xchg instruction with a memory operand is implicitly locked, so if you have a cycle involving a stack location it's more expensive than it looks. I don't remember if that can actually happen in RA2 today.

  • According to Agner Fog's x86 instruction performance tables, xchg is implemented as three micro-ops and takes a couple of cycles. I really hoped it would just fiddle with the register renaming table, but no, it's actually using execution units for some reason?

  • In a cycle of N registers, we could use N-1 xchg instructions to drag a value all the way around the cycle while shifting the other values in the opposite direction. Using a scratch register instead involves N+1 mov instructions, plus a load and store if we have to create a scratch spillslot. But considering that mov instructions can retire four or more per cycle while xchg can only retire one per cycle, it seems likely that mov wins at all cycle lengths except when we have to spill to memory.

  • I seem to recall I couldn't find an equivalent for xchg on other architectures besides x86. But I could easily have missed something.

All that said, as Chris suggested, it would be really interesting if you try the experiment and let us know what happens!

jameysharp avatar Apr 10 '25 02:04 jameysharp