zerocopy icon indicating copy to clipboard operation
zerocopy copied to clipboard

Prevent panics statically

Open joshlf opened this issue 2 years ago • 21 comments

Issues like #200 demonstrate that the compiler is sometimes not smart enough to optimize out panics that we can prove statically won't happen. It would be good if we could:

  • Refactor to only use operations which don't include panicking in their call graph
  • Validate this in CI

See also #325, #1125

Mentoring instructions

  • [ ] Easy steps
    • [ ] Create a new step in this CI job that runs Red Pen on zerocopy, only running the step under if: matrix.crate == 'zerocopy' && matrix.toolchain == 'nightly'
    • [ ] Annotate as many functions as possible with #[redpen::dont_panic]
  • [ ] Medium to hard (depending on the code): Refactor as much code as possible to remove all panics

joshlf avatar Jul 28 '23 18:07 joshlf

Here is a discussion on another project about having a no panic policy with some good information: https://github.com/orgs/tpm-rs/discussions/5

I think it could help the zerocopy library if it were interested in some automated CI checks to ensure no panics are present in "tier 1" tool chains.

jettr avatar Aug 03 '23 17:08 jettr

Awesome, thanks for that breadcrumb! I'd like to go even further than that, and ensure that panics aren't just optimized out, but never emitted in the first place. Do you know if that technique would work in conjunction with opt-level = 0, and would that suffice to guarantee that we're never emitting panics? I want to make sure we don't get into the situation in #200 in which panics are optimized out in a toy example but not in a larger application.

joshlf avatar Aug 07 '23 21:08 joshlf

I like the idea of using opt-level=0 for the "tier 1" tool chains that CI checks against to ensure there are no panics. I don't know of a way to ensure panics are never emitted with 100% certainty though. I think your idea is about as close as you can get with currently tooling.

jettr avatar Aug 07 '23 22:08 jettr

I'm starting to work on this. I suspect I'll run out of time to send a PR this week but I hope to send one next week.

jrvanwhy avatar Oct 12 '23 23:10 jrvanwhy

Awesome, thanks! I'll assign it to you. Feel free to comment here if you have any questions.

joshlf avatar Oct 12 '23 23:10 joshlf

It turns out that redpen does not catch every possible panic path. For example, it does not error on this program:

pub trait MaybePanic { fn maybe_panic(&self); }

pub struct WillPanic;
impl MaybePanic for WillPanic {
    fn maybe_panic(&self) { panic!(); }
}

#[redpen::dont_panic]
pub fn will_this_panic<M: MaybePanic>(m: M) {
    m.maybe_panic();
}

#[redpen::dont_panic]
pub fn main() {
    will_this_panic(WillPanic);
}

So we have a few tools at our disposal, all of which are incomplete:

  1. Linking-based checks: Build code that uses zerocopy in a way where we can look at the emitted symbols/links to determine whether a panic could have happened. no_panic is an example of this approach, but I think it could also be done using e.g. a custom #[panic_handler].
  2. Red Pen: use the redpen tool to find panics, marking everything that we don’t want to panic with #[redpen::dont_panic].
  3. Clippy lints: there are more clippy lints we can enable to find sources of accidental panics.

These all have their advantages and disadvantages:

Linking-based checks

Pro: Errors exactly when rustc emits a panic in generated code, which is the cost we care about. Con: Pretty much all of zerocopy is #[inline] or generic, so we can’t just check the zerocopy rlib. We’ll have to manually maintain an additional binary/library that invokes much of zerocopy’s functionality, which adds a lot of maintenance effort. I suspect this crate will be forgotten about frequently. Con: Sensitive to exact compiler versions, optimization levels. As previously mentioned, this could pass in zerocopy's CI when the same functionality will emit a panic in user code. Likely to break during nightly toolchain updates. Con: Ugly error messages, which may make it difficult to identify where the panic is.

Red Pen

Note: At the moment, redpen does not change its status code to indicate whether it found any issues. If we'd like to use redpen, I'll send a PR upstream to make it report its failures, and then we'll probably want to wait for a new release of redpen before we start using it.

Pro: Catches most panics. Pro: Clearly documents what can panic in the code (if it has #[redpen::dont_panic] then it can't panic... theoretically). Con: Relatively new and unproven. It requires a shim crate to be added to zerocopy's [dependencies]. That shim crate is simple, but may be concerning for some of zerocopy's users. Also, if it disappears we'll have to tear it out of zerocopy's codebase, and would need another approach to avoid panics anyway. Con: Awkward integration into the CI's toolchain test matrix, as it is pinned to its own toolchain. Con: It's one more tool for zerocopy contributors to install and run.

Clippy lints

With Rust 1.69 (what I happen to have on my system), the following Clippy lints point to code in zerocopy that can emit panic branches:

I also think there is merit in enabling missing_panics_doc, although at the moment it does not point out anything in zerocopy.

Pro: Well-supported tool that zerocopy already uses which emits well-formatted error messages. Con: Won't catch all panic causes. E.g. no clippy lint catches uses of Vec::swap_remove, which can panic.

Looking for feedback

We can choose to use any combination of linking checks, Red Pen, and Clippy to minimize panics. I would like input from zerocopy's maintainers on which approaches you would prefer. I can pursue any of them.

Here's my take:

  1. Linking-based checks are more toil than they are worth.
  2. I'm ambivalent on Red Pen. It is design to solve exactly the problem we have but has significant drawbacks.
  3. I think we should investigate the additional Clippy lints. I can send PRs that enable and address each lint individually, and we can see if each one is beneficial in practice.

Maybe I should enable and address the Clippy lints first, then give Red Pen a try and see how many panics it catches that Clippy doesn't? Then we can evaluate whether Red Pen is worth the effort.

jrvanwhy avatar Oct 19 '23 00:10 jrvanwhy

Holy cow, thank you for such a detailed writeup!!

Maybe I should enable and address the Clippy lints first, then give Red Pen a try and see how many panics it catches that Clippy doesn't? Then we can evaluate whether Red Pen is worth the effort.

This sounds like a good plan to me!

Linking-based checks

... Con: Pretty much all of zerocopy is #[inline] or generic, so we can’t just check the zerocopy rlib. We’ll have to manually maintain an additional binary/library that invokes much of zerocopy’s functionality, which adds a lot of maintenance effort. I suspect this crate will be forgotten about frequently.

Do you think it'd be possible to do this in zerocopy itself? A few thoughts:

  • If we mark a generic function as #[no_panic] and then instantiate it with a concrete type (e.g. by doing let _ = foo::<T>;), will that exercise the check?
  • Maybe we could just emit code in zerocopy itself (e.g. when cfg(test)) which takes the place of an external crate, and so requires less overhead
  • Is there any way we could write an automatic check that determines whether each function has been called in this checking code?

Con: Sensitive to exact compiler versions, optimization levels. As previously mentioned, this could pass in zerocopy's CI when the same functionality will emit a panic in user code. Likely to break during nightly toolchain updates.

My hope is that we can get rid of panics without relying on the optimizer, and so we can run this check with opt-level = 0. Hopefully that will also make exact compiler versions a non-issue.

Red Pen

Note: At the moment, redpen does not change its status code to indicate whether it found any issues. If we'd like to use redpen, I'll send a PR upstream to make it report its failures, and then we'll probably want to wait for a new release of redpen before we start using it.

+1

Con: Relatively new and unproven. It requires a shim crate to be added to zerocopy's [dependencies]. That shim crate is simple, but may be concerning for some of zerocopy's users. Also, if it disappears we'll have to tear it out of zerocopy's codebase, and would need another approach to avoid panics anyway.

I was thinking we'd just have this be a dev-dependency and only check it during cargo test in CI. Would that be a viable approach?

joshlf avatar Oct 19 '23 03:10 joshlf

Maybe I should enable and address the Clippy lints first, then give Red Pen a try and see how many panics it catches that Clippy doesn't? Then we can evaluate whether Red Pen is worth the effort.

This sounds like a good plan to me!

Alright, I'll start working on Clippy lint PRs. We can continue discussing the other options in parallel.

Linking-based checks

... Con: Pretty much all of zerocopy is #[inline] or generic, so we can’t just check the zerocopy rlib. We’ll have to manually maintain an additional binary/library that invokes much of zerocopy’s functionality, which adds a lot of maintenance effort. I suspect this crate will be forgotten about frequently.

Do you think it'd be possible to do this in zerocopy itself? A few thoughts:

  • If we mark a generic function as #[no_panic] and then instantiate it with a concrete type (e.g. by doing let _ = foo::<T>;), will that exercise the check?

With the no_panic crate, the check only happens when the object files are linked into a binary. We can use the test binary, so the panic check happens during cargo test. In that case, you can invoke a particular monomorphization of the function, which will check that monomorphization.

  • Maybe we could just emit code in zerocopy itself (e.g. when cfg(test)) which takes the place of an external crate, and so requires less overhead

Yes, using the test binary works (I don't know why I didn't think of that initially).

  • Is there any way we could write an automatic check that determines whether each function has been called in this checking code?

We could use a code coverage tool. If we tag a function as #[no_panic], and the coverage tool indicates that function was executed by cargo test, then I think we can be confident the no-panic check ran on that function.

Con: Sensitive to exact compiler versions, optimization levels. As previously mentioned, this could pass in zerocopy's CI when the same functionality will emit a panic in user code. Likely to break during nightly toolchain updates.

My hope is that we can get rid of panics without relying on the optimizer, and so we can run this check with opt-level = 0. Hopefully that will also make exact compiler versions a non-issue.

Sadly, with the latest stable compiler (1.73.0), even an empty function requires opt-level = 1 to pass:

// Only compiles with optimization
#[no_panic::no_panic]
#[test]
fn foo() {
}

Red Pen

Note: At the moment, redpen does not change its status code to indicate whether it found any issues. If we'd like to use redpen, I'll send a PR upstream to make it report its failures, and then we'll probably want to wait for a new release of redpen before we start using it.

+1

Con: Relatively new and unproven. It requires a shim crate to be added to zerocopy's [dependencies]. That shim crate is simple, but may be concerning for some of zerocopy's users. Also, if it disappears we'll have to tear it out of zerocopy's codebase, and would need another approach to avoid panics anyway.

I was thinking we'd just have this be a dev-dependency and only check it during cargo test in CI. Would that be a viable approach?

I suspect yes, but it's not obvious to me. redpen is its own binary, so it's unclear how to enable Rust's test harness when using it. It may or may not pay attention to command line flags and environment variables, so that may require some more work and an upstream contribution to enable.

jrvanwhy avatar Oct 24 '23 20:10 jrvanwhy

IIUC https://github.com/estebank/redpen/commit/887ea4fde5839b7d041d367f30009fe6d4161e39 should fix the specific example provided in https://github.com/google/zerocopy/issues/202#issuecomment-1769693893

kupiakos avatar Nov 30 '23 23:11 kupiakos

IIUC estebank/redpen@887ea4f should fix the specific example provided in #202 (comment)

Does this mean that redpen is now "complete" in the sense that it won't miss any panics?

joshlf avatar Dec 12 '23 01:12 joshlf

IIUC estebank/redpen@887ea4f should fix the specific example provided in #202 (comment)

Does this mean that redpen is now "complete" in the sense that it won't miss any panics?

I've been trying to answer this question, but I believe that I need an upstream bug fixed before I can answer this with confidence.

I'm seeing a bit of disconnect between this quote:

My hope is that we can get rid of panics without relying on the optimizer, and so we can run this check with opt-level = 0. Hopefully that will also make exact compiler versions a non-issue.

and this review comment:

Yeah, I'd like to be able to go back to panicking once we can prove that the panic is optimized out, or ideally to some other proven-with-the-type-system approach that gives us a guarantee that there will be no panic. The problem with ? is that it won't panic, but might introduce a new branch, and that's harder for us to detect. By contrast, if we can detect panics with something like redpen, then we can have higher confidence that the method is well-optimized.

I think the underlying question is: Do we want to avoid panics on a semantic (e.g. pre-optimization) level, or only avoid emitting panics in a optimized binary? We've touched on it when comparing tools but I haven't asked it directly.

Tools like Clippy and Redpen help us avoid panics on a semantic level. They will still point out panics that are "obviously" dead code. The upside of this is that if we eliminate panics on a semantic level, then we know that panics will not be emitted for zerocopy code regardless of how downstream projects use zerocopy (and what compiler settings they use). The downside is that sometimes it forces us to replace panics with "we hope this is never taken" branches, as I do in PR 544.

In contrast, linking-based checks allow us to avoid emitting panics in an optimized binary.

jrvanwhy avatar Jan 25 '24 23:01 jrvanwhy

I think a library like zerocopy should try to be semantically panic-free and validate through something independent of the optimizer like redpen. This is primarily because a library cannot reliably test every way it's used to detect if panics could possibly be generated, especially across compiler versions/configurations. If we can get redpen working I believe it's worth the work to guard against panics in something of limited scope like zerocopy.

However, part of my work is to progressively remove panics in a large embedded binary that was written without explicit concern for panics. There are some code constructs that would be extremely non-trivial to rewrite in a way that is both semantically panic-free and doesn't invoke unsafe. So, I really need linking-based checks as a last resort and as an initial guard before redpen is working. My plan is to mark everything that can be semantically panic-free with redpen, and then progressively used linking-based checks for the rest.

kupiakos avatar Feb 27 '24 19:02 kupiakos

Sorry, I definitely phrased my review comment badly.

A few points:

  • I agree with @kupiakos that semantic panic-freedom is ideal
  • I also agree that it often requires unsafe code in practice
  • In cases where unsafe is required, I haven't thought through how I want to handle that. We try to avoid unsafe at all costs, so introducing new unsafe just to take our code from "this panic gets optimized out in practice" to "this semantically doesn't panic" is a tough pill to swallow. I could be convinced, though, and depending on the use cases, I could also see some more general-purpose unsafe utilities that let us do this in a way that's reusable; it'd just take engineering work that we haven't had the cycles for so far
  • At an absolute minimum, a tool like redpen is useful to annotate all functions that don't currently panic so we can prevent regressions
  • It would also be good to use a "hack" like link-time assertions for the cases where we can't yet prevent panics semantically, although of course I consider this a last resort
  • The thing that worried me in my review comment was introducing a branch. On further reflection, a panic also introduces a branch, so I guess ? is strictly better? That said, maybe a panic coaxes the optimizer to try harder to prove that it's unreachable? I'd need to do more thinking/godbolt-ing.

I hope that provides some clarity. Let me know if you have any other questions or thoughts on how best to proceed.

joshlf avatar Mar 05 '24 18:03 joshlf

That said, maybe a panic coaxes the optimizer to try harder to prove that it's unreachable?

I spent some time looking into this and AFAICT the compiler doesn't try to optimize more to avoid panics. The cases I looked at were division by a non-constant (contains check-for-zero and branch to panic) and slice indexing.

For slice indexing, I basically had to remove all uses of it in favor of higher-level operations (iterators, split_at, etc.), or else get and get_mut. The optimizer does seem to consider the panic code paths "cold" pretty reliably.

briansmith avatar Mar 05 '24 18:03 briansmith

Ah okay cool that's good to know.

joshlf avatar Mar 05 '24 19:03 joshlf

We got bit by another panic path today, again from split_at. This diff, an unambiguous ergonomic improvement, introduced a new panic path detected via the linking method:

-    let Some(mut counts) = buf
-        .get_mut(..size_of::<ResetCounts>())
-        .and_then(Ref::<_, ResetCounts>::new)
-    else {
+    let Some(mut counts) = ResetCounts::mut_from_prefix(buf) else {

I don't have stats, but I've noticed split_at, split_at_mut, and copy_from_slice are usually the panic-causing culprits.

kupiakos avatar Mar 21 '24 22:03 kupiakos

We got bit by another panic path today, again from split_at. This diff, an unambiguous ergonomic improvement, introduced a new panic path detected via the linking method:

-    let Some(mut counts) = buf
-        .get_mut(..size_of::<ResetCounts>())
-        .and_then(Ref::<_, ResetCounts>::new)
-    else {
+    let Some(mut counts) = ResetCounts::mut_from_prefix(buf) else {

I don't have stats, but I've noticed split_at, split_at_mut, and copy_from_slice are usually the panic-causing culprits.

How are you detecting these panics? Do you have static tooling that you're using? We're considering landing #1071 to help reduce panic paths, and it'd be good to have a way to test to confirm that it's actually having the desired effect.

joshlf avatar Mar 27 '24 20:03 joshlf

How are you detecting these panics

@joshlf This was detected via the linking method, as mentioned. This is how we do it specifically:

  • Ensure that the #[panic_handler] function calls a #[no_mangle] function. Give that function a well-defined name like panic_is_possible. Hint extremely strongly to LLVM to never optimize or inline it with #[cold], #[inline(never)], and a body containing core::hint::black_box(()). I believe you can also make the #[panic_handler] itself directly the symbol being detected.
  • Build without stripping symbols - defer this to a later build step that invokes objcopy.
  • After build, check if the well-defined name is present in the non-stripped output binary. We currently use objdump on the object file and grep for the symbol.

This forbids panics in a single output binary. It does not function as designed for anything that isn't fully linked.

We also could've had the panic handler produce a linking failure by referencing an undefined symbol, but we determined that the developer experience was much worse than a tool explicitly saying "a panic was introduced in this binary, but it forbids them". This tooling is not public and is highly specific to our bespoke build system.

Do you have static tooling that you're using

I'm not entirely sure what you mean by "static tooling" here. It doesn't run the code to do the detection, so it's static in that way.

kupiakos avatar Mar 28 '24 18:03 kupiakos

Do you have a trick for working backwards from the #[no_mangle] function to the panic location?

jswrenn avatar Mar 28 '24 19:03 jswrenn

Do you have a trick for working backwards from the #[no_mangle] function to the panic location?

@jswrenn objdump -CSd <binary> | less, then search for the panic handler symbol - you'll find a jump somewhere. There are other binary analysis tools, but this is the quick and dirty way. If you've managed to eliminate the other panics from the binary and this is a new trigger, it's usually easy to find the specific function that does it. We don't currently have automatic tooling that points to a source location or anything like that.

It's highly sensitive to compiler and build changes; it's definitely non-ideal but better than re-introducing panic paths after we managed to remove all of them in a binary. It's not uncommon that the panic is in code you didn't directly touch due to the additional optimizer pressure introduced by the change.

kupiakos avatar Mar 30 '24 00:03 kupiakos

Related: https://github.com/google/zerocopy/issues/1125

joshlf avatar Apr 19 '24 18:04 joshlf