OD2 icon indicating copy to clipboard operation
OD2 copied to clipboard

Increase connection limit for od2-prod/db and add connection monitoring

Open decimalator opened this issue 10 months ago • 0 comments

Descriptive summary

We ran into an issue where the od2-prod/web Pods were keeping connections open to od2-prod/db and od2-prod/db ran out of available connection slots. This resulted in 404 and 50x errors in the frontend.

Diagnosing

Postgres will start throwing errors when there aren't enough connection slots and a client tries to connect:

2025-03-14 17:18:47.859 UTC [622] FATAL: sorry, too many clients already
2025-03-14 17:18:48.658 UTC [623] FATAL: sorry, too many clients already
2025-03-14 17:18:48.949 UTC [624] FATAL: sorry, too many clients already
2025-03-14 17:18:51.369 UTC [625] FATAL: sorry, too many clients already
2025-03-14 17:18:52.044 UTC [626] FATAL: sorry, too many clients already

The netstat command on the db-0 Pod will let us know how many connections are open, and from where. The command below will sort them by client IP address and source port number. You can compare the IPs of clients with the IPs of Pods in od2-prod to see where they are coming from.

kubectl -n od2-prod exec -it db-0 -- netstat -an | grep 5432 | sort -k 5

Remediation

While it won't solve the underlying cause, restarting the web/web-admin Deployments will force Rails to re-establish database connections and should bring them back to a baseline.

Both deployments are configured to use Rolling Updates, so these restarts won't cause any downtime.

kubectl -n od2-prod rollout restart deploy/web
kubectl -n od2-prod rollout restart deploy/web-admin

Tasks

  • [ ] Increase the Postgres connection limit

  • [ ] Add PostgreSQL exporter monitoring for od2-prod/db

  • [ ] Add an alert to warn us when we're nearing the limit for connections

decimalator avatar Mar 14 '25 18:03 decimalator