sentry-java icon indicating copy to clipboard operation
sentry-java copied to clipboard

Improve grouping and stack trace linking for ANRs

Open markushi opened this issue 1 year ago • 3 comments

Description

Ideally we can have aggregated stacktraces for ANRs, allowing use to pinpoint the actual root cause for an ANR and improving our grouping on top.

We should look into this library as well: https://github.com/brendangregg/FlameGraph

markushi avatar Oct 31 '24 11:10 markushi

@Chog0 FYI, to keep you in the loop

markushi avatar Oct 31 '24 11:10 markushi

If we decide to implement flame charts in the issues UI talk to the issues team

romtsn avatar Feb 12 '25 14:02 romtsn

Hey everyone, sorry for the long wait! We're still looking into this and I'm happy to share an update on the current progress. If you're interested, please take your time and continue with the lengthy read below - and let us know what you think about it! We're still looking into ways of providing all context to better understand and solve ANRs.

So far we implemented a proof-of-concept and shipped it with a dogfooding app, which has a large enough reach to provide us with valuable insights on ANRs and what kind of ANRs we can expect to see out in the wild.

Outline of the PoC implementation

  • checks every 99ms if the main thread is still responsive
  • if the main thread is unresponsive for more than 1000ms it will start to capture the main thread's stacktrace in a background thread every 99ms, this proved to have a negligible performance impact and provide a good enough granularity for ANRs
  • if the main thread stays unresponsive for 4000ms or more, all collected stacktraces are analyzed and the culprit is reported as an ANR event to sentry

The PoC was a partial success. It highlighted the culprit of certain ANRs better than before and thus reduced the issue noise, but not to a level we hoped for. Here's a list of all the "problematic" ones we discovered, where even a flame graph does not provide enough context to solve the underlying ANR.

1.  Single sample stacktraces

These are a bit mysterious, and can only be explained by the device being completely stuck. Even on a background thread we were unable to capture more than one stacktrace. It’s worth noting that in these situations not a single app frame was present within the stacktraces.

Proposed Solutions

  • Completely ignore (Officially called Mystery ANRs)
  • Aggregate into a single “Mystery ANR” issue, so at least it can be monitored and it wouldn’t create too much issue noise

Image Image

2. Main thread congestions

In this case there’s no clear culprit, and the device just seems to be slow to respond, so tasks within the main thread are queuing up, causing issues with e.g. input dispatching.

Proposed Solutions

  • Completely Ignore
  • Group into a single “Main Thread congestion" issue
  • TBD?

Image Image Image

3. Android OS level ANRs

In these situations the OS code seems to be the culprit for ANRs. E.g. Accessibility services, the soft keyboard or some IPC Binder calls. These are mostly unactionable for a developer. A common example would be invoking any super calls, like e.g. Activity.super.onPaused()

Proposed Solutions

  • TBD?

Image Image

markushi avatar Mar 24 '25 08:03 markushi