rust icon indicating copy to clipboard operation
rust copied to clipboard

Unobservable vec, and arrays writes always always emit for size >= 3

Open vimirage opened this issue 3 years ago • 7 comments

While revealing that dead Vec<_> writes always emit, to someone else, I decided to look into this for arrays.

It seems that, for any [T; N], such that T is not a ZST or enum with a single possible value, and 3 <= N, then array writes are always emit. This applies to allowing emitting redundant memcpys, memclrs, memfills, and all related operations.

https://rust.godbolt.org/z/7vrT5xnxq

As for Vec<T>, as long as T isn't as described above, then unobservable writes and operations are usually emit. https://rust.godbolt.org/z/jhM75Mj5E

vimirage avatar Apr 28 '22 01:04 vimirage

For [T; N], this is because we don't add lifetime markers to by-move arguments passed indirectly: https://rust.godbolt.org/z/vjd78n38a.

I believe we should be adding a lifetime.end marker at the end of the function.

(Note that byval also fixes this, but it should be avoided because it changes the ABI in a way that limits optimization potential, by requiring the argument to be in a specific stack position.)


For Vec<T>, LLVM should be able to see that the stores are dead because it knows that __rust_dealloc frees memory.

It appears to be failing here because we conditionally call __rust_dealloc, and LLVM doesn't have enough information to optimize out the condition, so __rust_dealloc doesn't actually postdominate the store.

For the simplified case of Box<[T]>, see https://rust.godbolt.org/z/vnPM5o1K4. Note that Box<[u32]> has an extra:

%6 = shl i64 %1, 2
%7 = icmp eq i64 %6, 0

This is doing something like x.len() * size_of::<u32>() == 0, to decide whether or not to call __rust_dealloc. This check shouldn't be necessary, because we just checked x.len() == 0 above that, and we know the multiplication never wraps (but LLVM doesn't).

The store is optimized out for Box<[u8]>, because size_of::<u8> == 1 and the two conditions are trivially equivalent.

Despite this, it's still not optimized out for Vec<u8>, because again __rust_dealloc doesn't postdominate the store. In this case it seems like it's because the bounds check is based on x.len() and the dealloc check is based on x.capacity(), and LLVM doesn't know that x.len() <= x.capacity().

erikdesjardins avatar Apr 29 '22 04:04 erikdesjardins

Is there any way that LLVM could be informed x.len() <= x.capacity()? At a minimum, we could toss in an extra

unsafe {
 if !(x.len() <= x.capacity()) {
  std::hint::unreachable_unchecked();
 }
}

to Vec::{len, capacity} accessors? It could give LLVM the extra information necessary to optimize this code path? Otherwise, maybe an upstream issue could be made about introducing an attribute for marking members/variables/values/etc as always following a condition such as the above, so we could bind via a macro? Any suggestions that could be valuable for improving this?

vimirage avatar Jun 11 '22 23:06 vimirage

Yeah, that would likely work, or equivalently std::intrinsics::assume(x.len() <= x.capacity()). Might be worth opening a PR to run perf and make sure it's not too disruptive (assumes can interfere with other optimizations).

erikdesjardins avatar Jun 13 '22 22:06 erikdesjardins

This is doing something like x.len() * size_of::<u32>() == 0, to decide whether or not to call __rust_dealloc. This check shouldn't be necessary, because we just checked x.len() == 0 above that, and we know the multiplication never wraps (but LLVM doesn't).

I'm working on this, should be straightforward to use unchecked multiplication in that size calculation

erikdesjardins avatar Jun 13 '22 22:06 erikdesjardins

For the case of [T; N], from #98121 it seems that emitting lifetime.end is not viable.

A more viable option would be to add an attribute to upstream LLVM, like noreadonunwind/noreadonreturn discussed here: https://groups.google.com/g/llvm-dev/c/i0Z1FC51KVI.

However this is a bit less important now than when this issue was first opened, since MIR DSE is now able to remove dead stores in some cases, including the simple motivating case for [T; N]: https://rust.godbolt.org/z/MGe59WrxY.

This is still a problem when the argument is directly overwritten though (https://rust.godbolt.org/z/G3odYbq78), e.g.

pub fn example_f(mut x: String) {
    x = "foo".to_string(); // this still writes to memory in the caller's stack frame
}

erikdesjardins avatar Jun 17 '22 17:06 erikdesjardins

For the case of [T; N], from #98121 it seems that emitting lifetime.end is not viable.

A more viable option would be to add an attribute to upstream LLVM, like noreadonunwind/noreadonreturn discussed here: https://groups.google.com/g/llvm-dev/c/i0Z1FC51KVI.

Ah, is this the only thing still inhibiting optimization of more complicated cases/situations regarding dead code / dead stores? It does seem correct that the lifetime of the argument copy has ended within the function, yet I saw it mentioned in the PR that inlining caused complications with that?

This is still a problem when the argument is directly overwritten though (https://rust.godbolt.org/z/G3odYbq78), e.g.

pub fn example_f(mut x: String) {
    x = "foo".to_string(); // this still writes to memory in the caller's stack frame
}

Ouch, well.. this does impact arrays of Move-only types too! I'd certainly consider it to be in the same vein of issue, e.g.:

pub fn f(mut x: [Vec<u32>; 2]) {
    // three drops are emit: x[0], x[0], x[1], and an allocation for the new vector
    x[0] = vec![1, 2];
}

Hopefully the developers working on LLVM can eventually get a working implementation of what is described in that discussion, good luck.

vimirage avatar Jun 17 '22 20:06 vimirage

@rustbot label A-LLVM

vimirage avatar Jun 17 '22 22:06 vimirage

#121298 adds writable / dead_on_unwind (discussed above as noreadonunwind) to sret arguments.

But in order to improve this issue (i.e. to remove the stores to %x here: https://rust.godbolt.org/z/zexzc8oqa), we would also require dead_on_return. (Or on the rustc side we need to copy indirectly passed by-value arguments to an alloca at the entry of the function, but that might introduce other optimization problems.)

erikdesjardins avatar Feb 24 '24 19:02 erikdesjardins