self-hosted icon indicating copy to clipboard operation
self-hosted copied to clipboard

Very high disk activity caused by Redis since upgrading from 22.06.0 to 22.10.0

Open edhgoose opened this issue 3 years ago • 24 comments

Self-Hosted Version

22.10.0

CPU Architecture

x86_64

Docker Version

20.10.17, build 100c701

Docker Compose Version

1.29.2, build 5becea4c

Steps to Reproduce

We can't seem to reproduce the issue. It is happening on a semi regular basis, suggesting it might be to do with a cron job or something similar.

Expected Result

We have an AWS EC2 instance running the self hosted version of Sentry.

Since upgrading to 22.10 we have found that we get periodic, long, spikes of intense disk read activity. During these periods the EC2 instance becomes unusable and crashes. We must reboot the EC2 instance to recover.

This is a graph of Read Ops and Write Ops combined: CleanShot 2022-11-02 at 20 21 35@2x

And just of Read Ops: CleanShot 2022-11-02 at 20 22 12@2x

The instance is unreachable via SSH, although Amazon says the instance is still up. ping is failing, and our AWS ALB health check reports unhealthy too.

We have upgraded from a gp2 instance to gp3 to see if that will help, but as yet no luck.

We have fed all our logs for Sentry into Cloudwatch but have not been able to spot an obvious candidate as to the cause for this problem. We would appreciate some guidance on where to look.

So far, some patterns and thoughts we have identified:

  • We use Sentry Relay, and have observed that we get a significant number of requests around the point the issue occurs. However, this appears to be a spike due to the problem, rather than causing the problem.
  • We do see errors in the logs occasionally about redis, like:

redis.exceptions.ResponseError: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.

However, these appear to happen after the instance has restarted and do not appear to be the cause.

  • We believe we're seeing errors like this one as a result of Sentry restarting/recovering, but we also believe these are red herrings
  • Our disk is a 500gb EBS gp3 instance, with approximately 50% free disk space available.

Actual Result

We would expect Sentry (or dependent tools) not to hit disk as intensively, and not to crash :)

Event ID

No response

edhgoose avatar Nov 02 '22 20:11 edhgoose

We left the disk churning overnight. The issue started around 19:58, and by 23:50 the disk read activity had returned to normal. When we woke up in the morning, the docker images were in a partially healthy state. The sentry web app was up (which is good!) - but reporting that there was an error.

docker-compose ps reported:

sentry-self-hosted_clickhouse_1                                 /entrypoint.sh                   Up (healthy)     8123/tcp, 9000/tcp, 9009/tcp
sentry-self-hosted_cron_1                                       /etc/sentry/entrypoint.sh  ...   Up               9000/tcp
sentry-self-hosted_geoipupdate_1                                /usr/bin/geoipupdate -d /s ...   Exit 1
sentry-self-hosted_ingest-consumer_1                            /etc/sentry/entrypoint.sh  ...   Exit 1
sentry-self-hosted_kafka_1                                      /etc/confluent/docker/run        Up (healthy)     9092/tcp
sentry-self-hosted_memcached_1                                  docker-entrypoint.sh memcached   Up (healthy)     11211/tcp
sentry-self-hosted_nginx_1                                      /docker-entrypoint.sh ngin ...   Up               0.0.0.0:9000->80/tcp,:::9000->80/tcp
sentry-self-hosted_post-process-forwarder-errors_1              /etc/sentry/entrypoint.sh  ...   Restarting
sentry-self-hosted_post-process-forwarder-transactions_1        /etc/sentry/entrypoint.sh  ...   Exit 1
sentry-self-hosted_postgres_1                                   /opt/sentry/postgres-entry ...   Up (healthy)     5432/tcp
sentry-self-hosted_redis_1                                      docker-entrypoint.sh redis ...   Up (unhealthy)   6379/tcp
sentry-self-hosted_relay_1                                      /bin/bash /docker-entrypoi ...   Up               3000/tcp
sentry-self-hosted_sentry-cleanup_1                             /entrypoint.sh 0 0 * * * g ...   Up               9000/tcp
sentry-self-hosted_smtp_1                                       docker-entrypoint.sh exim  ...   Up               25/tcp
sentry-self-hosted_snuba-api_1                                  ./docker_entrypoint.sh api       Up               1218/tcp
sentry-self-hosted_snuba-cleanup_1                              /entrypoint.sh */5 * * * * ...   Up               1218/tcp
sentry-self-hosted_snuba-consumer_1                             ./docker_entrypoint.sh con ...   Up               1218/tcp
sentry-self-hosted_snuba-outcomes-consumer_1                    ./docker_entrypoint.sh con ...   Exit 1
sentry-self-hosted_snuba-replacer_1                             ./docker_entrypoint.sh rep ...   Up               1218/tcp
sentry-self-hosted_snuba-sessions-consumer_1                    ./docker_entrypoint.sh con ...   Up               1218/tcp
sentry-self-hosted_snuba-subscription-consumer-events_1         ./docker_entrypoint.sh sub ...   Up               1218/tcp
sentry-self-hosted_snuba-subscription-consumer-transactions_1   ./docker_entrypoint.sh sub ...   Up               1218/tcp
sentry-self-hosted_snuba-transactions-cleanup_1                 /entrypoint.sh */5 * * * * ...   Up               1218/tcp
sentry-self-hosted_snuba-transactions-consumer_1                ./docker_entrypoint.sh con ...   Exit 1
sentry-self-hosted_subscription-consumer-events_1               /etc/sentry/entrypoint.sh  ...   Restarting
sentry-self-hosted_subscription-consumer-transactions_1         /etc/sentry/entrypoint.sh  ...   Exit 1
sentry-self-hosted_symbolicator-cleanup_1                       /entrypoint.sh 55 23 * * * ...   Up               3021/tcp
sentry-self-hosted_symbolicator_1                               /bin/bash /docker-entrypoi ...   Up               3021/tcp
sentry-self-hosted_web_1                                        /etc/sentry/entrypoint.sh  ...   Up (healthy)     9000/tcp
sentry-self-hosted_worker_1                                     /etc/sentry/entrypoint.sh  ...   Exit 1
sentry-self-hosted_zookeeper_1                                  /etc/confluent/docker/run        Up (healthy)     2181/tcp, 2888/tcp, 3888/tcp

Running docker-compose up sorted us out.

The total disk ops activity looked like this - so something was really aggressively reading from disk for a long period, and then finished. The tiny blip around 8am is when we rebooted Sentry.

CleanShot 2022-11-03 at 16 34 06@2x

We can't see anything obvious in our logs - and during the period no logs were written to AWS unfortunately.

We're now monitoring iotop to see if there's any clues in there.

We see Clickhouse is the primary reader, and Redis is the primary writer (but redis only writes every now and again - in big chunks of about 1G). Clickhouse doesn't look to be actually doing much - SHOW PROCESSLIST doesn't show anything obvious, and the table sizes don't look absurd:

┌─table─────────────────┬─size───────┬─────rows─┬─days─┬─avgDaySize─┐
│ outcomes_raw_local    │ 333.23 MiB │ 19437398 │    0 │ inf YiB    │
│ transactions_local    │ 1.74 GiB   │ 12518960 │    0 │ inf YiB    │
│ metric_log            │ 1.18 GiB   │ 11858191 │  138 │ 8.73 MiB   │
│ errors_local          │ 121.11 MiB │   398028 │    0 │ inf YiB    │
│ trace_log             │ 13.83 MiB  │   236079 │  138 │ 102.64 KiB │
│ sessions_raw_local    │ 260.48 KiB │    10931 │    0 │ inf YiB    │
│ outcomes_hourly_local │ 113.43 KiB │     7938 │    0 │ inf YiB    │
│ sessions_hourly_local │ 68.59 KiB  │      638 │    0 │ inf YiB    │
│ migrations_local      │ 2.30 KiB   │       97 │    0 │ inf YiB    │
└───────────────────────┴────────────┴──────────┴──────┴────────────┘

Postgres is definitely the largest use of disk, so we will attempt to look at that next.

edhgoose avatar Nov 03 '22 16:11 edhgoose

I'm wondering if this has something to do with https://github.com/getsentry/self-hosted/issues/1787

hubertdeng123 avatar Nov 04 '22 04:11 hubertdeng123

We've been looking much more into this today.

While last night everything was fine, this morning we'd had the same problem around 3am - disk ops spiked, the service crashed, everything fell over.

We found that - and we don't know if these are correlations or causations - that:

  • The worker service seems to be the source of our woes
  • Something within there is causing problems.
  • We originally suspected the cleanup and cron jobs too - but cleanup we eliminated quite quickly. cron could still be a contender, but it looks like worker is the more likely one.
  • We identified that we didn't have logging on for the worker or the cron box - I think when we turned on logging we only had it for things we thought might affect the web, so we didn't include those two. We've now got logs for them both.

We think there might also be a correlation with mail being the problem.

We're using AWS SES to send emails over SMTP. We noticed that there does appear to be a bit of a correlation between when these crashes occur and how many emails we're sending. We're disabling mail by changing the config to console and seeing if that makes a difference.

It does seem a bit farfetched, but we're going to try it out.

edhgoose avatar Nov 04 '22 18:11 edhgoose

For reference, here's our docker-compose.override.yml:

x-logging-config: &logging_config
  environment:
    SENTRY_LOG_LEVEL: DEBUG
  logging:
    driver: "awslogs"
    options:
      awslogs-region: "eu-west-1"
      awslogs-group: "sentry"
      tag: "{{.Name}}/{{.ID}}"
services:
  clickhouse:
    <<: *logging_config
  kafka:
    <<: *logging_config
  redis:
    <<: *logging_config
  postgres:
    <<: *logging_config
  zookeeper:
    <<: *logging_config
  snuba-api:
    <<: *logging_config
  snuba-outcomes-consumer:
    <<: *logging_config
  snuba-sessions-consumer:
    <<: *logging_config
  snuba-transactions-consumer:
    <<: *logging_config
  snuba-replacer:
    <<: *logging_config
  snuba-cleanup:
    <<: *logging_config
  snuba-transactions-cleanup:
    <<: *logging_config
  web:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  cron:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  worker:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  ingest-consumer:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  post-process-forwarder-transactions:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  post-process-forwarder-errors:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  subscription-consumer-events:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  subscription-consumer-transactions:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  sentry-cleanup:
    <<: *logging_config
    extra_hosts:
      - host.docker.internal:host-gateway
  nginx:
    <<: *logging_config
  relay:
    <<: *logging_config

edhgoose avatar Nov 04 '22 18:11 edhgoose

It's crashed again... which implies our mail theory isn't right. Although we have noticed it seems to be staying up and recovering rather than completely crashing.

We've also noticed that the oom-killer is getting called and seems to be a contributor to the crashes - which would make sense. And I've got a graph of memory usage that implies that it's something about memory:

CleanShot 2022-11-05 at 13 28 25@2x

So maybe the redis theory is right, given that is the highest user of memory according to docker stats:

1cc5a59ce662   sentry-self-hosted_nginx_1                                      0.09%     4.355MiB / 15.44GiB   0.03%     1.19GB / 1.22GB   25.1GB / 0B       5
34904d3fed80   sentry-self-hosted_relay_1                                      0.76%     635.1MiB / 15.44GiB   4.02%     1.13GB / 6.25GB   91.6GB / 0B       46
956605a56494   sentry-self-hosted_post-process-forwarder-transactions_1        0.31%     117.3MiB / 15.44GiB   0.74%     2.97MB / 2.3MB    2.02GB / 0B       11
59408f368f5f   sentry-self-hosted_subscription-consumer-transactions_1         0.22%     116.1MiB / 15.44GiB   0.73%     2.85MB / 1.99MB   2.24GB / 0B       6
9f009fecf548   sentry-self-hosted_sentry-cleanup_1                             0.00%     600KiB / 15.44GiB     0.00%     989kB / 465kB     472MB / 30.7kB    1
1f597341a95b   sentry-self-hosted_post-process-forwarder-errors_1              0.42%     136.2MiB / 15.44GiB   0.86%     49.1MB / 1.72MB   183MB / 0B        13
13a1e52891ea   sentry-self-hosted_cron_1                                       0.00%     157.9MiB / 15.44GiB   1.00%     6.55MB / 57.4MB   3.76GB / 6.83MB   4
d9a9476857b5   sentry-self-hosted_web_1                                        0.01%     593.6MiB / 15.44GiB   3.75%     42MB / 84.9MB     19.1GB / 0B       34
83c61cdfc5e2   sentry-self-hosted_ingest-consumer_1                            0.26%     126.6MiB / 15.44GiB   0.80%     241MB / 255MB     35.4MB / 0B       7
457602f1ba25   sentry-self-hosted_subscription-consumer-events_1               0.24%     116.4MiB / 15.44GiB   0.74%     1.79MB / 1.22MB   126MB / 0B        6
531e85eb600f   sentry-self-hosted_worker_1                                     0.97%     645.9MiB / 15.44GiB   4.08%     436MB / 1.08GB    1.56GB / 0B       41
598795b34e5c   sentry-self-hosted_snuba-transactions-consumer_1                0.55%     85.22MiB / 15.44GiB   0.54%     347MB / 15.3MB    26.8MB / 0B       10
ee64b3406876   sentry-self-hosted_snuba-subscription-consumer-events_1         0.34%     77.25MiB / 15.44GiB   0.49%     2.24MB / 1.32MB   129MB / 0B        9
c06eb9e4b632   sentry-self-hosted_snuba-transactions-cleanup_1                 0.00%     480KiB / 15.44GiB     0.00%     680kB / 391kB     9.69GB / 32.3kB   1
c0a813c3c68e   sentry-self-hosted_snuba-outcomes-consumer_1                    0.53%     77.31MiB / 15.44GiB   0.49%     6.06MB / 4.34MB   27.7MB / 0B       10
19182d2461bc   sentry-self-hosted_snuba-sessions-consumer_1                    0.37%     78.49MiB / 15.44GiB   0.50%     56.8MB / 39.1MB   8.39GB / 0B       10
34cf34263bc8   sentry-self-hosted_snuba-api_1                                  0.04%     90.58MiB / 15.44GiB   0.57%     4.39MB / 12.9MB   12.6GB / 0B       6
0c4b90059d72   sentry-self-hosted_snuba-cleanup_1                              0.00%     608KiB / 15.44GiB     0.00%     680kB / 389kB     9.73GB / 30.2kB   1
1f9e6759a0e6   sentry-self-hosted_snuba-subscription-consumer-transactions_1   0.35%     76.2MiB / 15.44GiB    0.48%     58.6MB / 41.4MB   9.68GB / 0B       9
a6c3ffd5aeb4   sentry-self-hosted_snuba-replacer_1                             0.16%     76.8MiB / 15.44GiB    0.49%     2.82MB / 1.92MB   1.42GB / 0B       5
63415d42fa5a   sentry-self-hosted_snuba-consumer_1                             0.30%     79.82MiB / 15.44GiB   0.50%     8.03MB / 2.8MB    192MB / 0B        10
3a31087c771b   sentry-self-hosted_kafka_1                                      2.34%     811.3MiB / 15.44GiB   5.13%     13.7GB / 14.5GB   61.3GB / 290MB    75
e5dab8fbf5be   sentry-self-hosted_smtp_1                                       0.00%     1.102MiB / 15.44GiB   0.01%     327kB / 183B      1.24GB / 0B       2
e6471bb4d9b7   sentry-self-hosted_redis_1                                      0.26%     9.683GiB / 15.44GiB   62.69%    579MB / 254MB     2.44GB / 28.4GB   5
bdd519bc1ef0   sentry-self-hosted_symbolicator_1                               0.00%     6.738MiB / 15.44GiB   0.04%     327kB / 0B        7.7GB / 0B        14
02a7677bc98d   sentry-self-hosted_postgres_1                                   0.04%     174.6MiB / 15.44GiB   1.10%     2.48GB / 213MB    50.3GB / 11.5GB   12
3611090a51f6   sentry-self-hosted_clickhouse_1                                 5.59%     444.6MiB / 15.44GiB   2.81%     332MB / 55.1MB    129GB / 9.57MB    75
56f235ca22fa   sentry-self-hosted_zookeeper_1                                  0.12%     95.69MiB / 15.44GiB   0.61%     2.03MB / 1.13MB   21.7GB / 240kB    48
ea1ab29f6086   sentry-self-hosted_memcached_1                                  0.11%     42.38MiB / 15.44GiB   0.27%     242MB / 794MB     10.6GB / 0B       10
bf5c6f7e0026   sentry-self-hosted_symbolicator-cleanup_1                       0.00%     492KiB / 15.44GiB     0.00%     327kB / 0B        752MB / 31.7kB    1

I'm not sure what normal operation would be though - so I'm unsure if 9gig for redis is a lot or not.

I've installed this monitoring script for redis and this one for docker to see if there's any clues in there.

edhgoose avatar Nov 05 '22 14:11 edhgoose

P.s. it might be useful to know: We're using a t3.xlarge AWS instance. That's 4 cores + 16gb of memory.

edhgoose avatar Nov 05 '22 14:11 edhgoose

Still continuing to crash. Not seeing anything especially obvious in our logs. oom-killer does seem like the thing responsible for killing the service.

We're trying a new redis config:

In docker-compose.override.yml:

  redis:
    <<: *logging_config
    command: redis-server /usr/local/etc/redis/redis.conf
    volumes:
      - ./redis.conf:/usr/local/etc/redis/redis.conf
    ports:
      - "6379:6379"

And redis.conf:

# Redis configuration file example.

maxmemory 4294967296 # 8589934592


maxmemory-policy volatile-ttl

Since then, memory usage has been much lower (as expected), but it's too early to tell yet whether this has resolved the problem.

I've also used:

redis-cli --scan | cut -c-1 | sort | uniq -c to try and identify how many different sorts of keys there are... see if that gives a clue:

      4 _
  69416 c
      1 o
      1 r
  16579 t

I'm not familiar with redis much though, so I don't know if this is useful or not.

edhgoose avatar Nov 06 '22 18:11 edhgoose

Since we've turned on the memory limit, we see the graph of evictions has gone up significantly (from basically 0 to loads)

CleanShot 2022-11-07 at 12 25 36@2x

And I think this one illustrates what's going on - memory was increasing and increasing and nothing was ever being evicted, leading to eventual crashes.

CleanShot 2022-11-07 at 12 27 10@2x

My current working theory is then:

  • Something was previously causing Redis to increase and increase
  • When redis got more and more full, it started hitting swap I guess (leading to an increase in disk activity?)
  • Which eventually triggered the oom-killer to kill something, which may or may not break the Sentry system.

By limiting memory we're increasing the number of evictions, and freeing up memory.

We haven't seen a crash since that was applied, but it's still early days.

Which again points to https://github.com/getsentry/self-hosted/issues/1787 being the problem.

edhgoose avatar Nov 07 '22 12:11 edhgoose

Really great debugging, thanks for digging into this. It does seem that Redis either isn't able to write to disk or otherwise dispatch incoming data, so it is keeping everything in memory until you fill your memory... then you swap out, trash the disk, things crash, an all-around bad time.

Thinking about how to debug redis... I guess you could strace? Might be too noisy. Also maybe consider checking to see when services crash/experience errors. It could be the worker image crashes for some reason then redis can't do anything...

emmatyping avatar Nov 07 '22 17:11 emmatyping

I confirm since making this change, we have had no crashes. Limiting the amount of memory has solved the problem.

I don't think we're having any trouble writing to disk - we're seeing successful writes continually in the logs. And I haven't noticed anything broken in Sentry - everything seems ok.

Here's what redis' logs are for the last 10 minutes for instance:

-------------------------------------------------------------------------------------------------
|   timestamp   |                                    message                                    |
|---------------|-------------------------------------------------------------------------------|
| 1667936128087 | 1:M 08 Nov 2022 19:35:28.086 * 10000 changes in 60 seconds. Saving...         |
| 1667936128207 | 1:M 08 Nov 2022 19:35:28.207 * Background saving started by pid 25942         |
| 1667936143415 | 25942:C 08 Nov 2022 19:35:43.414 * DB saved on disk                           |
| 1667936143557 | 25942:C 08 Nov 2022 19:35:43.557 * RDB: 11 MB of memory used by copy-on-write |
| 1667936143741 | 1:M 08 Nov 2022 19:35:43.741 * Background saving terminated with success      |
| 1667936204086 | 1:M 08 Nov 2022 19:36:44.085 * 10000 changes in 60 seconds. Saving...         |
| 1667936204175 | 1:M 08 Nov 2022 19:36:44.175 * Background saving started by pid 25955         |
| 1667936218537 | 25955:C 08 Nov 2022 19:36:58.537 * DB saved on disk                           |
| 1667936218684 | 25955:C 08 Nov 2022 19:36:58.684 * RDB: 9 MB of memory used by copy-on-write  |
| 1667936218870 | 1:M 08 Nov 2022 19:36:58.870 * Background saving terminated with success      |
| 1667936279058 | 1:M 08 Nov 2022 19:37:59.058 * 10000 changes in 60 seconds. Saving...         |
| 1667936279235 | 1:M 08 Nov 2022 19:37:59.235 * Background saving started by pid 25974         |
| 1667936295144 | 25974:C 08 Nov 2022 19:38:15.144 * DB saved on disk                           |
| 1667936295307 | 25974:C 08 Nov 2022 19:38:15.307 * RDB: 11 MB of memory used by copy-on-write |
| 1667936295603 | 1:M 08 Nov 2022 19:38:15.603 * Background saving terminated with success      |
| 1667936356062 | 1:M 08 Nov 2022 19:39:16.062 * 10000 changes in 60 seconds. Saving...         |
| 1667936356149 | 1:M 08 Nov 2022 19:39:16.149 * Background saving started by pid 25989         |
| 1667936370636 | 25989:C 08 Nov 2022 19:39:30.636 * DB saved on disk                           |
| 1667936370821 | 25989:C 08 Nov 2022 19:39:30.821 * RDB: 12 MB of memory used by copy-on-write |
| 1667936371064 | 1:M 08 Nov 2022 19:39:31.064 * Background saving terminated with success      |
| 1667936432060 | 1:M 08 Nov 2022 19:40:32.059 * 10000 changes in 60 seconds. Saving...         |
| 1667936432305 | 1:M 08 Nov 2022 19:40:32.305 * Background saving started by pid 26008         |
| 1667936447504 | 26008:C 08 Nov 2022 19:40:47.504 * DB saved on disk                           |
| 1667936447649 | 26008:C 08 Nov 2022 19:40:47.649 * RDB: 10 MB of memory used by copy-on-write |
| 1667936447860 | 1:M 08 Nov 2022 19:40:47.860 * Background saving terminated with success      |
| 1667936508000 | 1:M 08 Nov 2022 19:41:48.000 * 10000 changes in 60 seconds. Saving...         |
| 1667936508142 | 1:M 08 Nov 2022 19:41:48.142 * Background saving started by pid 26021         |
| 1667936522108 | 26021:C 08 Nov 2022 19:42:02.108 * DB saved on disk                           |
| 1667936522311 | 26021:C 08 Nov 2022 19:42:02.311 * RDB: 12 MB of memory used by copy-on-write |
| 1667936522517 | 1:M 08 Nov 2022 19:42:02.517 * Background saving terminated with success      |
| 1667936583049 | 1:M 08 Nov 2022 19:43:03.049 * 10000 changes in 60 seconds. Saving...         |
| 1667936583163 | 1:M 08 Nov 2022 19:43:03.163 * Background saving started by pid 26040         |
| 1667936598269 | 26040:C 08 Nov 2022 19:43:18.269 * DB saved on disk                           |
| 1667936598413 | 26040:C 08 Nov 2022 19:43:18.413 * RDB: 12 MB of memory used by copy-on-write |
| 1667936598689 | 1:M 08 Nov 2022 19:43:18.689 * Background saving terminated with success      |
| 1667936659022 | 1:M 08 Nov 2022 19:44:19.022 * 10000 changes in 60 seconds. Saving...         |
| 1667936659171 | 1:M 08 Nov 2022 19:44:19.171 * Background saving started by pid 26053         |
-------------------------------------------------------------------------------------------------

@ethanhs - does anything stand out in these logs which suggest there are problems writing to disk? Anything you'd expect to see which isn't there?

Now we've got the eviction happening, it feels more and more like the keys growing rather than there being a problem with redis specifically.

@hubertdeng123 is there anything you can think would help with debugging? The urgency is off for me, but while it's all fresh in my mind I'd be happy to try more things.

edhgoose avatar Nov 08 '22 19:11 edhgoose

Yeah I was originally thinking it was due to the

: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk

error you mentioned above but I'm pretty sure that's a red herring. I don't see anything out of the ordinary in those logs.

emmatyping avatar Nov 08 '22 20:11 emmatyping

Do we have people here who actually know the data flow of Sentry. I've tried to figure it out but can't find which component should read the entries from redis and delete afterwards. To see what changed between 22.09 and 22.10 in this component. As the number of keys in redis didn't grow in 22.09. Just after upgrading to 22.10 the problem started. Can't be redis as we have it configured outside of the Sentry compose file and redis didn't change at all. We also can't use eviction policy as we have important keys in the same redis cluster.

sander85 avatar Nov 08 '22 20:11 sander85

I am having the same issue on my instance, redis memory grew until it filled all the available memory, and some consumers + clickhouse were taking a lot of CPU usage.

I had those errors in my Sentry Internal project: image

Things are getting better after configuring maxmemory and maxmemory-policy volatile-ttl as hinted above, I assume redis started to delete old keys, its memory usage went down and the consumers are now able to correctly process the jobs.

We updated to 22.10.0 a few days ago and this never happened before, so it is most probably related to this release.

renchap avatar Nov 15 '22 09:11 renchap

This issue has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you label it Status: Backlog or Status: In Progress, I will leave it alone ... forever!


"A weed is but an unloved flower." ― Ella Wheeler Wilcox 🥀

github-actions[bot] avatar Dec 07 '22 00:12 github-actions[bot]

I would like to keep this open

edhgoose avatar Dec 07 '22 08:12 edhgoose

backlogging this so it won't be closed

hubertdeng123 avatar Dec 07 '22 17:12 hubertdeng123

@edhgoose I am a bit stuck on configuring Amazon SES with my Sentry instance.

I have made the related changes on - this link

Just elaborating to be clear:

  1. Created an .env.custom file with an entry for SENTRY_MAIL_HOST=email-smtp.ap-south-1.amazonaws.com and I use this env file to start my docker eco system.

  2. In the sentry/config.yml I have these entries

mail.backend: 'smtp'
mail.host: 'email-smtp.ap-south-1.amazonaws.com'
mail.port: 587
mail.username: 'myuser'
mail.password: 'mypwd'
mail.use-tls: false
mail.use-ssl: false
mail.from: '[email protected]'
  1. In the sentry/sentry.conf.py file, I have commented the sentry options.

Even with all this, I still get Connection Refused in the worker docker's logs.

I'll appreciate any help or pointers on this one, please?

swateek avatar Jan 09 '23 04:01 swateek

@swateek I don't know how to help - I think you want to raise a new issue. Your issue isn't related to this one about disk activity.

edhgoose avatar Jan 10 '23 07:01 edhgoose

Just for everyone else's reference, I have raised an issue #1901

swateek avatar Jan 10 '23 14:01 swateek

face the same issue after upgrading from 22.06 to 23.2

alex4506 avatar Mar 02 '23 08:03 alex4506

We're seeing the same issue.

@edhgoose do you think we should update the issue title given that the underlying cause seems to be Redis and not the disk writes?

GhostLyrics avatar Mar 08 '23 14:03 GhostLyrics

@edhgoose do you think we should update the issue title given that the underlying cause seems to be Redis and not the disk writes?

Sure. I've gone for a merge of the two.

edhgoose avatar Mar 09 '23 08:03 edhgoose

Based on my research looks like sentry start "put" events data to redis with expiration 24 hours, which lead this events very quickly fill memory in redis (because one event can have 100-200kb of stringify json) - https://github.com/getsentry/self-hosted/issues/1787#issuecomment-1454899562

Some process (maybe workers) forget to do DEL/UNLINK this data after GET this data from redis and process it

le0pard avatar Mar 13 '23 09:03 le0pard

I would also like to know how to set the memory eviction policy in redis - and to be sure it doesn't cause issues to Sentry.

klemen-df avatar Oct 08 '24 07:10 klemen-df

It looks like https://github.com/getsentry/self-hosted/pull/3427 resolves this issue. It might still be a good idea to specify a maxmemory in the redis.conf.

https://github.com/getsentry/self-hosted/blob/b439c67e17120c9fdd20aec5250a6357f78bd6c0/redis.conf#L3-L6

robkorv avatar Jan 13 '25 11:01 robkorv