Investigate alternate stack- and register-allocation strategies in backend

Open MatthewFluet opened this issue 8 years ago • 0 comments

#217 updated the stack-allocation strategy in the RSSA to Machine translation:

Do not force Cont block arguments to stack (7ec42a1)
Improve Allocation.Stack.get (31b6e29)

Unfortunately, the improved Allocation.Stack.get has degraded the hamlet benchmark (but not others):

MLton0 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen amd64
MLton1 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen c
MLton2 -- /home/mtf/devel/mlton/builds/20171030.143531-ga81514e/bin/mlton -codegen llvm
MLton3 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen amd64
MLton4 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen c
MLton5 -- /home/mtf/devel/mlton/builds/20171101.181350-g7ec42a1/bin/mlton -codegen llvm
MLton6 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen amd64
MLton7 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen c
MLton8 -- /home/mtf/devel/mlton/builds/20171101.190843-g31b6e29/bin/mlton -codegen llvm
MLton9 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen amd64
MLton10 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen c
MLton11 -- /home/mtf/devel/mlton/builds/20171105.113517-gfd0d8e4/bin/mlton -codegen llvm
run time ratio
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet      1.00   2.53   2.27   0.99   2.39   2.29   1.18   2.61   2.44   1.16    2.70    2.46
size
benchmark    MLton0    MLton1    MLton2    MLton3    MLton4    MLton5    MLton6    MLton7    MLton8    MLton9   MLton10   MLton11
hamlet    1,458,564 1,501,276 1,529,948 1,454,516 1,497,868 1,526,700 1,445,700 1,490,156 1,519,524 1,449,492 1,492,732 1,523,316
compile time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     12.82  23.79  40.39  11.96  23.34  41.01  12.11  23.10  40.84  12.66   23.58   41.44
run time
benchmark MLton0 MLton1 MLton2 MLton3 MLton4 MLton5 MLton6 MLton7 MLton8 MLton9 MLton10 MLton11
hamlet     33.70  85.28  76.34  33.37  80.64  77.29  39.73  87.84  82.11  39.26   90.87   82.93

(20171105.113517-gfd0d8e4 is 31b6e29 with 7ec42a1 reverted.)

The improved Allocation.Stack.get was not expected to have a negative effect on runtime; it should simply find different stack slots for variables and possibly decrease stack frame sizes.

One conjecture is that the improved Allocation.Stack.get causes variables that previously shared a stack slot to now have different stack slots. This could impact intraprocedural Goto transfers, where, previously, a formal and an actual argument were assigned the same stack slot (and the move of the actual to the formal was a noop), but, now, the formal and the actual are assigned different stack slots (and the move of the actual to the formal is an executed move).

In general, the current stack- and register-allocation strategy (i.e., the mapping of RSSA variables to Machine globals, stack-slots, or registers) in the RSSA to Machine translation is greedy. [Remember, in the Machine IR, "registers" are really local variables that are not live across a non-tail or runtime call. The hope is that they will map to hardware registers, but in the Machine IR, there can be an arbitrary number of registers.] The strategy is essentially to begin with the stack- and register-allocation dictated by the live-in of a block and then greedily assign stack-slots and registers to variables defined within the block. Allocations are not updated when variables go dead within a block (potentially freeing up stack-slots and/or registers). Nor is any attempt made to optimize the placement of variables to reduce moves.

For Machine IR registers, this is mostly acceptable, since they codegens will perform their own assignment of these Machine IR registers (remember, "local variables") to hardware registers and spill locations, presumably using coalescing and other heuristics.

For Machine IR stack-slots, though, this can be problematic, because the codegens will generally perform reads and writes to stack-slots as instructed; in particular, a stack-slot to stack-slot move can never be elided by the C and LLVM codegens (since they appear as reads and writes of memory) and aren't easily elided by the native x86/amd64 codegens (the -native-live-stack {false|true} option was meant to track the liveness of stack slots in the native codegens so as to allow a Machine IR stack slot to live entirely in hardware registers without requiring writing the value back to memory, but the management of liveness information in the native codeges is naive (and performs poorly) and the number of stack-allocated variables can be large, so it never performed well-enough on self-compiles to become the default).

In any case, one could investigate better allocation strategies for the RSSA to Machine translation. In particular, actually build an interference graph and recognize variable-to-variable moves and make assignments to avoid extraneous moves.

Nov 06 '17 16:11 MatthewFluet