sentry-java icon indicating copy to clipboard operation
sentry-java copied to clipboard

Built-in monitoring & alerting for stuck / unprocessed envelopes

Open dalnoki opened this issue 3 months ago • 23 comments

Problem Statement

A customer encountered an issue where their application silently stopped sending events to Sentry. After local investigation, they discovered a number of .envelope files accumulating in the SDK’s local Sentry folder including very large files (~2GB) and very old files (dating back to February).

After manually clearing the folder, the Sentry SDK immediately resumed normal operation.

To prevent this kind of silent failure, they are requesting a built-in mechanism in the Sentry SDK that can detect envelope backlogs and surface an alert or warning to the Sentry dashboard (or at least application logs) so teams can act before ingestion silently stops.

Solution Brainstorm

No response

dalnoki avatar Nov 27 '25 09:11 dalnoki

JAVA-254

linear[bot] avatar Nov 27 '25 09:11 linear[bot]

@dalnoki is this Android, JVM Desktop or something else?

adinauer avatar Dec 01 '25 09:12 adinauer

Hi, original reporter here.

This is on JVM 21, Temurin runtime. Desktop I guess? I mean: we're running it on a server 😁.

We're using Java modules io.sentry:sentry and io.sentry:sentry-logback both version 8.26.0.

On Nov. 6th we switched from 8.23.0 to 8.25.0 On Nov. 18th we switched from 8.25.0 to 8.26.0

The oldest unsent .envelope file in the system was from February, so none of these versions was able to clean up the backlog. We noticed that Sentry stopped sending since we sent a "startup warning" every release in order to verify we didn't break anything. We release every week and upon inspection Sentry SDK stopped sending messages somewhere halfway the release.

Unfortunately, the engineer that worked on the production machine removed the envelopes instead of moving them for later inspection. So I can't tell you what was in the envelopes so you can reproduce.

Please let me know if we can be of any (further) assistance.

mhogerheijde avatar Dec 01 '25 10:12 mhogerheijde

Thanks for the additional details @mhogerheijde.

Just to make sure, you did configure cacheDirPath, correct?

Let me check if there's some cleanup we perform on Android but don't on JVM.

adinauer avatar Dec 01 '25 11:12 adinauer

@mhogerheijde another question, did you configure maxCacheItems? If so, what value?

adinauer avatar Dec 01 '25 11:12 adinauer

Thanks for the additional details @mhogerheijde.

Just to make sure, you did configure cacheDirPath, correct?

Let me check if there's some cleanup we perform on Android but don't on JVM.

Yes, we do set the cacheDirPath. That path is bind-mounted to the host that runs the Docker container.

I suspect docker containers restarting might have something to do with this maybe?

mhogerheijde avatar Dec 01 '25 16:12 mhogerheijde

@mhogerheijde another question, did you configure maxCacheItems? If so, what value?

We didn't set this parameter

mhogerheijde avatar Dec 01 '25 17:12 mhogerheijde

@mhogerheijde another question, did you configure maxCacheItems? If so, what value?

We didn't set this parameter

Correction: we do set this to 100

mhogerheijde avatar Dec 01 '25 17:12 mhogerheijde

@mhogerheijde another question, did you configure maxCacheItems? If so, what value?

We didn't set this parameter

Correction: we do set this to 100

Ah! And I just looked at the dir-listing we did when investigating: there were exactly 100 old .envelope files in the folder.

So the fact that Sentry stopped sending is conform the configuration, but it would be nice if Sentry somehow recognises that it's queue ran full and is unable to send new messages, which is kind of a catch-22 🙃

mhogerheijde avatar Dec 01 '25 17:12 mhogerheijde

Thanks for confirming, will try to reproduce.

adinauer avatar Dec 02 '25 09:12 adinauer

I just tested some things and I was unable to exactly reproduce the problem. While offline, the .envelope files are stored in the configured cacheDirPath. Once full (maxCacheItems), the SDK starts dropping old files and replaces them with new files. Once connection is restored, the SDK is able to send new events to Sentry successfully.

Files in cacheDirPath will remain there even after the connection is restored. There is an integration, that sends them. It can be configured like this:

  @Bean
  public Sentry.OptionsConfiguration<SentryOptions> sentryOptionsConfiguration() {
    return options -> {
      options.addIntegration(new SendCachedEnvelopeFireAndForgetIntegration(new SendFireAndForgetEnvelopeSender(() -> options.getCacheDirPath())));
    };
  }

Note: This is for Spring Boot, you may only need the addIntegration line.

This still only sends cached envelopes on startup of the application, so it may not be enough for your use case. You could also implement the IConnectionStatusProvider, so it'll trigger sending once the connection is restored.

Sorry about this being very inconvenient to use - it's rarely used in non Android applications.

I was not able to reproduce the SDK not sending events anymore once the cache is full.

I'm also surpised to see 2GB of files with a limit of 100 files. That's ~ 20MB per file which Sentry would fail to ingest. One thing that might help you out here is our new options:

  • https://docs.sentry.io/platforms/java/guides/spring-boot/configuration/options/#enableEventSizeLimiting
  • https://docs.sentry.io/platforms/java/guides/spring-boot/configuration/options/#onOversizedEvent These should help reduce event size / allow you to drop what you don't need.

On second thought the size of the events might even be related to your described behaviour. Let me test this. In the meantime could you please try the new options and see if this solves your problem?

adinauer avatar Dec 02 '25 10:12 adinauer

Looks like oversized events are in fact the problem here.

This is likely related to https://github.com/getsentry/sentry-java/issues/4921

adinauer avatar Dec 02 '25 12:12 adinauer

I'm also surpised to see 2GB of files with a limit of 100 files. That's ~ 20MB per file which Sentry would fail to ingest.

To clarify: not 2GiB total, multiple files were 2GiB each

mhogerheijde avatar Dec 03 '25 11:12 mhogerheijde

Looks like oversized events are in fact the problem here.

This is likely related to https://github.com/getsentry/sentry-java/issues/4921

Check! I will take a look at the suggestions you made above.

mhogerheijde avatar Dec 03 '25 11:12 mhogerheijde

To be pedantic: bugs will exist and even maybe outside of Sentry SDK there might be issues with the queue. So my request was not solely about finding the root cause for stale envelopes, but rather to have some alerting on when there are.

We now just have a cronjon that lists the folder and sends an alert when envelopes older than 2 weeks exist.

mhogerheijde avatar Dec 03 '25 11:12 mhogerheijde

have some alerting on when there are.

This already exists in Sentry UI but is quite hidden away under Settings | Status & Usage. Are you looking for a hook in the SDK?

To clarify: not 2GiB total, multiple files were 2GiB each

Sentry will not be able to ingest envelopes this big so you should find out what part of the event is large and then drop those in the onOversizedEvent callback. This might mean dropping certain fields or substringing things like the message.

It's even better if you can avoid adding this much data to an event in the first place as having to handle 2GB events most likely slows down the SDK and your application.

adinauer avatar Dec 03 '25 12:12 adinauer

There's also https://github.com/getsentry/sentry/issues/103139 for some alerting in the Sentry product in case you would like to also comment there and describe what you'd expect Sentry to do.

adinauer avatar Dec 03 '25 13:12 adinauer

have some alerting on when there are.

This already exists in Sentry UI but is quite hidden away under Settings | Status & Usage. Are you looking for a hook in the SDK?

Correct me if I'm wrong, but this seems to show things were blocked on the Sentry.io end? In our case Sentry SDK was blocked from sending due to a full queue, can we detect that in the UI too?

To clarify: not 2GiB total, multiple files were 2GiB each

Sentry will not be able to ingest envelopes this big so you should find out what part of the event is large and then drop those in the onOversizedEvent callback. This might mean dropping certain fields or substringing things like the message.

It's even better if you can avoid adding this much data to an event in the first place as having to handle 2GB events most likely slows down the SDK and your application.

Good to know! I don't know why these were this large; that was not intentional by us. We will have a look at that callback to alert us when this happens so we can investigate

mhogerheijde avatar Dec 08 '25 09:12 mhogerheijde

We had a restart of our server today, and we are left with 13 unsent envelopes:

.../sentry/846e2159c1635d9c7012a115a1da70403d376b04$ ls -alsh
total 3,0M
4,0K drwxr-xr-x 3 root root 4,0K dec  8 09:16 .
4,0K drwxr-xr-x 3 root root 4,0K nov 24 09:56 ..
 24K -rw-r--r-- 1 root root  24K dec  5 08:14 2bc0884284de471382d3fba7e0fe1ad4.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:35 53d8cc0aa3b64303b158e03c9281fb11.envelope
2,7M -rw-r--r-- 1 root root 2,7M dec  2 08:45 78209d18ebce4877969c4702015b64f6.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:40 816b29f32111405ea534a292a39d2d09.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:40 8687919cc7354ddabe56103c548a7749.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:30 91bcfd4882a64757b31547297f1c01c0.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:45 aeb9de3e40404a9885a0bf277eaf39f2.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:45 bc795219f1ef498cbaba360756e23747.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:30 d7f819003c6f47d2983a34d116d9b2b2.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:24 ea6969f78e5343de8538ac672ef86732.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:24 f1806642a71c4252a6a457c2013f680b.envelope
 24K -rw-r--r-- 1 root root  22K dec  5 08:19 f5cd63ab95a845248e7cf6319215cfd5.envelope
 24K -rw-r--r-- 1 root root  24K dec  5 08:14 f8c2deeb7a23430fa01c78a8c84c859c.envelope
4,0K drwxr-xr-x 2 root root 4,0K nov 24 09:56 outbox

server time is UTC.

Our backend is currently up-and-running. Sentry is reporting, but these envelopes are still lingering around in the queue. Current server time is

$ date
ma  8 dec 2025  9:56:55 UTC

I can share the envelopes with you if you want, but since this is our production server, I'd rather not share those traces publicly.

mhogerheijde avatar Dec 08 '25 09:12 mhogerheijde

FYI, the envelopes all seem to be valid JSON to me (they parse as such), and a quick visual inspection didn't seem to indicate something wrong with these files.

mhogerheijde avatar Dec 10 '25 08:12 mhogerheijde

Thank you for the additional information, @mhogerheijde. My colleague @adinauer is currently OOO and will be back next Monday to continue looking into this. In the meantime you can send me a couple of those envelopes via email at [email protected] and I can quickly check whether they are valid.

lcian avatar Dec 10 '25 08:12 lcian

@lcian thanks for the update. For us there is currently no rush or otherwise urgency on this. We've changed our monitoring to also look at the cache size and if it grows to much we get alerts.

I'll sent the captured envelopes to you soon.

Just to make sure, you did configure cacheDirPath, correct?

We do set cacheDirPath. Inside the container it is set to /etc/hiber/sentry. That path is mounted onto the docker host filesystem, so the cache directory is persisted across containers

mhogerheijde avatar Dec 10 '25 14:12 mhogerheijde

@mhogerheijde I've reviewed one of your envelopes. The culprit (at least in that one), is a too large value in exceptions.values. That's an exception message that contains a very long SQL query. The result is that Relay rejects the envelope with status 400. Note that the envelope itself is not oversized, so I assume the SDK treats it differently from an oversized envelope, and ends up keeping it in the cache. I'll leave it to @adinauer to determine the best path forward here as this seems related to the SDK truncation project we're working on.

lcian avatar Dec 11 '25 15:12 lcian