Increase connection limit for od2-prod/db and add connection monitoring
Descriptive summary
We ran into an issue where the od2-prod/web Pods were keeping connections open to od2-prod/db and od2-prod/db ran out of available connection slots. This resulted in 404 and 50x errors in the frontend.
Diagnosing
Postgres will start throwing errors when there aren't enough connection slots and a client tries to connect:
2025-03-14 17:18:47.859 UTC [622] FATAL: sorry, too many clients already
2025-03-14 17:18:48.658 UTC [623] FATAL: sorry, too many clients already
2025-03-14 17:18:48.949 UTC [624] FATAL: sorry, too many clients already
2025-03-14 17:18:51.369 UTC [625] FATAL: sorry, too many clients already
2025-03-14 17:18:52.044 UTC [626] FATAL: sorry, too many clients already
The netstat command on the db-0 Pod will let us know how many connections are open, and from where. The command below will sort them by client IP address and source port number. You can compare the IPs of clients with the IPs of Pods in od2-prod to see where they are coming from.
kubectl -n od2-prod exec -it db-0 -- netstat -an | grep 5432 | sort -k 5
Remediation
While it won't solve the underlying cause, restarting the web/web-admin Deployments will force Rails to re-establish database connections and should bring them back to a baseline.
Both deployments are configured to use Rolling Updates, so these restarts won't cause any downtime.
kubectl -n od2-prod rollout restart deploy/web
kubectl -n od2-prod rollout restart deploy/web-admin
Tasks
-
[ ] Increase the Postgres connection limit
-
[ ] Add PostgreSQL exporter monitoring for
od2-prod/db -
[ ] Add an alert to warn us when we're nearing the limit for connections