prometheus-engine Prometheus Crash Looping on startup

Screen Shot 2023-04-17 at 11 16 03

Related to https://github.com/prometheus/prometheus/issues/6934

I'm trying to change the resource limits but since it is managed, I do not have the permissions to edit / modify anything in the gke-gmp-system namespace.

How can I reset the pod so that it wipes the wal directory. There was an issue where it was scraping too many timeseries, and caused it to get into this crash looping situation. I limited the number of timeseries but now the collector is stuck and can't progress.

Any recommendations?

Apr 17 '23 15:04 ridersofrohan

Hey,

Unless I am wrong here - it seems the collector does NOT have any persistent storage at the moment (see https://github.com/GoogleCloudPlatform/prometheus-engine/blob/main/manifests/operator.yaml#L657), so I am surprised there IS anything to wipe out (and causing crash loop), unless you are using something custom here. Are you sure you are interpreting the root cause of crash loop correctly? Maybe the high cardinality scrape is still happening? What is the size of the node (unfortunately we have static limit at the moment, so for extremely big nodes our limit might not enough).

If there would be persistent storage (host dir or PV) the solutions would be:

Remove PV
Increase limit (which you tried)
Wipe the WAL manually. To do so you need to attach debug container to the collector and delete /prometheus/data/wal directory (or whole prometheus/data)

However, as I mentioned, something else might be happening as there is no persistent storage for collector as far as I know (: Hope that helps!

Apr 19 '23 20:04 bwplotka

ridersofrohan - https://github.com/GoogleCloudPlatform/prometheus-engine/pull/469 may address the issue you are facing but it is still under discussion. Just want to provide you some visibility.

May 16 '23 15:05 StevenYCChou

I have a similar issue where I see collector pods crash looping on startup but then become stable. Looking at the logs seems to suggest the config-reloader health check fails before prometheus service has started, triggering a pod restart.

Logs:

config-reloader level=info ts=2023-07-20T06:19:59.058365954Z caller=main.go:82 msg="ensure ready-url is healthy
config-reloader level=error ts=2023-07-20T06:19:59.560166348Z caller=main.go:91 msg="polling ready-url" err="Get \"http://localhost:19090/-/ready\": dial tcp 127.0.0.1:19090: connect: connection refused
...
prometheus ts=2023-07-20T06:21:49.020Z caller=main.go:966 level=info msg="Server is ready to receive web requests.
...
config-reloader level=info ts=2023-07-20T06:21:55.501328058Z caller=main.go:95 msg="ready-url is healthy
config-reloader level=info ts=2023-07-20T06:21:55.50151996Z caller=main.go:153 msg="Starting web server for metrics" listen=:19091

Related to https://github.com/GoogleCloudPlatform/prometheus-engine/issues/472

Jul 20 '23 11:07 ego93

Hi @ego93 - yes this is a known issue with a tentative fix in https://github.com/GoogleCloudPlatform/prometheus-engine/pull/474

Jul 20 '23 12:07 pintohutch

The OOM due to WAL replay/potential WAL corruption was fixed in main (https://github.com/GoogleCloudPlatform/prometheus-engine/pull/947) and being released as 0.13.0 version. Will close this issue after initial version will be available on latest GKE (or manual install).

Note that it's still possible to trigger OOM simply by adding excessive amount of targets/metrics to scrape to a single agent. The same version will have optional VPA you can configure to avoid this if you have such cases.

Feedback and reports welcome around OOM problems, but in separate issues please 🤗 Thanks!

Jun 03 '24 10:06 bwplotka