List of missing peephole optimizations
Is it possible doing some trivial logical simplifications for now? Or better wait souper with z3 solver for that? like simplify from:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 10
i32.gt_s
get_local $p0
i32.const 10
i32.eq
i32.or)
to:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 9
i32.gt_s)
or:
(func $test (export "test") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 10
i32.ge_s)
DONE: from:
get_global $i
i32.const 0
i32.le_s
i32.eqz
if
to:
get_global $i
i32.const 0
i32.gt_s
if
Good idea, we should do these!
It's not hard to add them. The correct pass is OptimizeInstructions (src/passes/OptimizeInstructions.cpp). Do you want to do it?
Oh, great! I think better if you start some basics and may be after I add additional methods/rules for that pass =)
@MaxGraey Sounds good, I can get the stuff set up so that it's just a matter of adding the specific rules into that pattern. Take a look at https://github.com/WebAssembly/binaryen/commit/d3e5d4416f1bcb8fe3998db39351be766ba250aa - I added the first rule into combineOr there, and a test in $combine-or. The rest should be straightforward, let me know if you need any help.
(I realized I need to land a few other things before that commit, so I can't land it right now, but posting this now so you can get started if you want.)
Great! Are you plan merge this to master? Or I can just fork oi2 and continue work in separate branch?
btw I use one trick with range checking if all arguments is unsigned. Like:
(x >= C0) & (x <= C1)
where x, C0 and C1 are unsigned ints
(unsigned)x - C0 <= C1 - C0
Also integer sequence of logical expressions like:
(x == C) | (x == C + 1) | ... | (x == N)
could transform to:
(unsigned)(x - C) <= N - C
is it make sense implement this rule as well, wdyt?
@MaxGraey
Yes, for now fork oi2 and work on that, since I might not be able to land it today.
That unsigned trick looks valid to me. Maybe leave it for a separate PR afterwards though, so there isn't too much in one?
Also figure outing some useful rules for integer (sing/unsign) comparators:
x > x - 1
could always transform to:
x != MIN_VALUE
and
x + 1 > x
could always transform to:
x != MAX_VALUE
Also wdyt using for x != 0:
get_local $0
i32.eqz
i32.eqz
instead
get_local $0
i32.const 0
i32.ne
This reducing code by 1 byte, but generate worse machine code so this probably useful only for shrink code >= 2. Of course this not very likely code because if/else/select don't require this operation explicitly
Comment before last looks good to me.
Last comment also looks good - as you said in if or select etc. inputs we can flip the arms and avoid != 0 anyhow, but there are cases when it happens elsewhere, so i think it's a good rule to add.
@MaxGraey how are you getting along with these optimizations? Anything I can help with?
@kripken Sorry, currently I busy on work but I hope I will return to this on weekend
Just note. Found another possible improvments like transform this:
i32.const 1
i32.const 0
get_local $p0
select
or this:
get_local $p0
if $I0 (result i32)
i32.const 1
else
i32.const 0
end
to this (depending on $p0 type)
get_local $p0
i32.const 0
i32.ne
UPDATE
Implemented in #2869
More possible optimizations:
- from:
(func $t1 (export "t1") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 0
i32.lt_s)
to:
(func $t1 (export "t1") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const 31
i32.shr_s)
- from:
(func $t2 (export "t2") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const -1
i32.ne)
to (but not always possible, probably valid only if result used as i1 in next steps):
(func $t2 (export "t2") (type $t0) (param $p0 i32) (result i32)
get_local $p0
i32.const -1
i32.xor)
one more peephole optimization:
~(1 << n) -> rotl(-2, n)
which looks like in wat as:
i32.const 1
get_local $p0
i32.shl
i32.const -1
i32.xor
to:
i32.const -2
get_local $p0
i32.rotl
Done!
(x >> C) != 0 =========> (unsigned)x > ((1 << C) - 1)
((unsigned)x >> C) > 0 ==> (unsigned)x > ((1 << C) - 1)
((signed)x >> C) > 0 ====> (signed)x > ((1 << C) - 1)
(x >> C) == 0 =========> (unsigned)x < ((1 << C) - 1)
(x >> C) < 0 ==========> (unsigned)x < ((1 << C) - 1)
((signed)x >> C) < 0 ====> (unsigned)x >> 31
~~((unsigned)x >> C) < 0 ==> 0~~
Such distributive optimizations for integers also missing currently:
x * y + x * z
x * y - x * z
to
x * (y + z)
x * (y - z)
Upd PR: #3132
~~(signed)x % С_pot == 0 ==> (x & (С_pot - 1)) == 0~~
~~(signed)x % С_pot != 0 ==> (x & (С_pot - 1)) != 0~~
x % C == C ==> 0
x % C != C ==> 1, if C != 0
(unsigned)x % C >= C,
(unsigned)x % C > C ==> 0, if C != 0
(signed)x % C >= C,
(signed)x % C > C ==> C < 0, if C != 0
(unsigned)x % C <= C,
(unsigned)x % C < C ==> 1, if C != 0
(signed)x % C <= C,
(signed)x % C < C ==> (C ^ -1) < 0, if C != 0
Pretty common patterns. But what is surprising is neither LLVM nor GCC can't optimize this:
(x > y) - (x < y) < 0 -> x < y
(x > y) - (x < y) <= 0 -> x <= y
(x > y) - (x < y) == 0 -> x == y
(x > y) - (x < y) > 0 -> x > y
(x > y) - (x < y) >= 0 -> x >= y
Need to think could we generalize this more
// add-sub integer patters
(x + y) - x -> y
(x - y) - x -> 0 - y
(x + y) - y -> x
(x - y) - y -> x - (y << 1)
x - (x + y) -> 0 - y
~~x - (x - y) -> y~~ (#3014)
y - (x + y) -> 0 - x
y - (x - y) -> (y << 1) - x
(x + y) + x -> (x << 1) + y
(x - y) + x -> (x << 1) - y
(x + y) + y -> (y << 1) + x
(x - y) + y -> x
however probably better add general Reassociate sub pass which also handle others associative ops like |, &, ^
// div-mul integer patters
((signed)x / y) * y -> x - ((signed)x % y)
((unsigned)x / y) * y -> x - ((unsigned)x % y), if pure(x) && pure(y) && y != C
~~-(x + C) -> -C - x~~
~~-(C - x) -> x - C~~
-1 - x -> x ^ -1
bool(x) ? -1 : 0 -> 0 - x
bool(x) ? 0 : -1 -> 0 - (x ^ 1)
where bool(x) is getBits(x) == 1
float point simplifications:
x > 0.0 ? 1.0 : -1.0 -> copysign(1.0, +x)
x >= 0.0 ? 1.0 : -1.0 -> copysign(1.0, +x)
x < 0.0 ? 1.0 : -1.0 -> copysign(1.0, -x)
x <= 0.0 ? 1.0 : -1.0 -> copysign(1.0, -x)
x > 0.0 ? -1.0 : 1.0 -> copysign(1.0, -x)
x >= 0.0 ? -1.0 : 1.0 -> copysign(1.0, -x)
x < 0.0 ? -1.0 : 1.0 -> copysign(1.0, +x)
x <= 0.0 ? -1.0 : 1.0 -> copysign(1.0, +x)
copysign(x, +C) -> abs(x)
copysign(x, -C) -> -abs(x)
x * copysign(1.0, x) -> abs(x)
x * copysign(1.0, -x) -> -abs(x)
~~abs(x) * abs(x) -> x * x~~ (#3013)
copysign(x, y) == 0 -> x == 0
copysign(copysign(x, y), z) -> copysign(x, z)
copysign(x, y) * copysign(x, y) -> x * x
~~sqrt(x) < C -> x >= 0 && x < C * C~~
x / C_pot -> x * (1 / C_pot) (#3018)
where x -> float point, C_pot power of two float point constant
~~x * 2.0 -> x + x~~ (#3016) done!
It seems this transform may reduce up to 7 bytes due to wasm doesn't use any variant encoding for floats cc @kripken
~~i32(i64(x)) -> x, where x::i32~~ done
~~i32(u64(x)) -> x, where x::i32~~ done
same as:
i64.extend_s/i32 ;; or i64.extend_u/i32
i32.wrap/i64
~~i64(u32(x)) -> x & 0xFFFFFFFF, where x::i64~~ not optimal
~~or~~
~~i32.wrap/i64
i64.extend_u/i32 ;; but not for extend_s/i32~~
x ? 0 : C_pot -> (!x) << log2(C_pot)
x ? C_pot : 0 -> (!!x) << log2(C_pot)
x * (y ? C1 : C2) -> y ? C1 * x : C2 * x, if C1, C2 <- -1 | 0 | 1
examples:
x * (y ? -1 : 1) -> y ? -x : x
x * (y ? 1 : 0) -> y ? x : 0
x * (y ? -1 : 0) -> y ? -x : 0
~x | x -> -1
~~~x ^ x -> -1~~
~x + x -> -1
~x + 1 -> 0 - x
~~-x - 1 -> ~x (x ^ -1)~~
-1 - x -> ~x (x ^ -1)
~~x < 0 ? -1 : 1 -> x >> 31 | 1~~ done!
x / C -> 0 if maxBits(x) < maxBits(C)
like:
(x & 3) / 4 -> 0
i32(f64.max(f64(x), f64(y)) -> select(x, y, x > y), if type(x & y) == i32 | u32
i32(f64.min(f64(x), f64(y)) -> select(x, y, x < y), if type(x & y) == i32 | u32
i32(f64.abs(f64(x))) -> select(-x, x, x < 0), if type(x) == i32
DONE
~~i32(f64(x)) -> x, if type(x) == i32 | u32~~
~~u32(f64(x)) -> x, if type(x) == i32 | u32~~
~~floor(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~ceil(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~trunc(f64(x)) -> f64(x), if type(x) == i32 | u32~~
~~nearest(f64(x)) -> f64(x), if type(x) == i32 | u32~~
-(i32(x) / C) -> i32(x) / -C, if x != u32, C != min_s
this can't be optimized nor div_s either div_u: ~~-x / C -> x / -C~~