weld
weld copied to clipboard
Add Loop Unrolling and Interleaving as an Optimization
These optimizations improve throughput in simple/small loops by amortizing the cost of control flow, providing more opportunities for out of order execution, etc. Currently the following loop in the new single threaded runtime:
|v: vec[i32]| result(for(v, merger[i32,+], |b, i, e| merge(b, e)))
produces the following assembly:
LBB0_2:
vpaddd (%rdi), %ymm0, %ymm0
addq $32, %rdi
addq $-1, %rsi
jne LBB0_2
In comparison, an unrolled an interleaved loop would produce something like this (the below is produced by LLVM's auto-vectorizer on a simple C for loop, with interleave count 4:
LBB0_14:
vpaddd -480(%rax,%rbx,4), %ymm1, %ymm1
vpaddd -448(%rax,%rbx,4), %ymm2, %ymm2
vpaddd -416(%rax,%rbx,4), %ymm3, %ymm3
vpaddd -384(%rax,%rbx,4), %ymm4, %ymm4
vpaddd -352(%rax,%rbx,4), %ymm1, %ymm1
vpaddd -320(%rax,%rbx,4), %ymm2, %ymm2
vpaddd -288(%rax,%rbx,4), %ymm3, %ymm3
vpaddd -256(%rax,%rbx,4), %ymm4, %ymm4
vpaddd -224(%rax,%rbx,4), %ymm1, %ymm1
vpaddd -192(%rax,%rbx,4), %ymm2, %ymm2
vpaddd -160(%rax,%rbx,4), %ymm3, %ymm3
vpaddd -128(%rax,%rbx,4), %ymm4, %ymm4
vpaddd -96(%rax,%rbx,4), %ymm1, %ymm1
vpaddd -64(%rax,%rbx,4), %ymm2, %ymm2
vpaddd -32(%rax,%rbx,4), %ymm3, %ymm3
vpaddd (%rax,%rbx,4), %ymm4, %ymm4
subq $-128, %rbx
addq $4, %rcx
jne LBB0_14