TLS error after Google automated update causes downtime
Bug Description
On 2 MySQL 5.7 instances in separate projects we had an "Update" operation automatically performed at 7am EST today. This aligned with an error we started receiving at the same time:
Reading data from project:region:db-name had error: remote error: tls: bad certificate
Restarting our Pods resolved the issue, I understand adding Kubernetes Probes to reboot the container would likely have minimized this issue to under a minute and am implementing them currently.
However, I'm curious if there's any way to avoid this altogether
Example code (or command)
# none
Stacktrace
Reading data from project:region:db-name had error: remote error: tls: bad certificate
How to reproduce
- Google Automated Upgrade
Environment
- OS type and version: debian
- Cloud SQL Proxy version (
./cloud_sql_proxy -version): 1.27 (Docker image latest)
Related: https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/400.
Also https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/980.
Out of curiosity are you using client-side connection pooling or similar to reconnect bad connections? We merged https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/817 thinking it would fix this problem.
@enocom I'm not 100% sure how to know that but judging off the managing connections documentation, I think we aren't. We're using ruby (rails) and have the following setup:
production:
adapter: <%= ENV["DB_ADAPTER"] || 'mysql2' %>
encoding: utf8
host: <%= ENV['MYSQL_HOST'] %>
username: <%= ENV['MYSQL_USERNAME'] %>
password: <%= ENV['MYSQL_PASSWORD'] %>
database: <%= ENV['MYSQL_DATABASE'] %>
port: <%= ENV['MYSQL_PORT'] %>
collation: utf8_general_ci
pool: 10
Which doesn't have the socket setting set like the doc. The pool setting is also 2x higher then in the documentation, I'm not sure why someone set it to 10, all I can see in our git history is that it was bumped from 2
It's been awhile since I've worked with ActiveRecord and Rails and so I'm not certain, but it looks like other people have had similar problems. See https://github.com/rails/rails/issues/38324 for example and a proposed workaround which goes hand-in-hand with Kubernetes probes.
I suspect what's happening here is the connection pool (size 10) all go stale during the update operation. Rails doesn't reset any of the connections and continues trying to use bad connections. I imagine you're seeing the "bad certificate" error repeatedly in the proxy logs. Is that correct?
Since this appears to be an issue with how Rails handles connection pooling, I recommend adding a liveness healthcheck as you suggest. In the meantime, I'm going to downgrade this to a P2 and investigate further.
Appreciate the help! I added in the probes for the proxy sidecar and already have similar probes in the Rails workaround which was failing and restarting the Rails container when this happened.
We are seeing the "bad certificate" error repeatedly as well
The additional probes should be sufficient for us now though
Here's what I tried:
- Started a simple Rails 6.1.4.1 app with a Cloud SQL MySQL database with RAILS_ENV set to production
- Confirmed that my database connection worked with the proxy running locally
- Stopped my database and confirmed the database connection was broken
- Started the database and confirmed I could connect with the
mysqlCLI and also the app without restarting the proxy.
In addition to stopping and starting the database, I also tried rotating the server CA cert and triggering a failover. In all cases, the app and the CLI managed to get a new connection without error. So it seems that Rails isn't holding on to stale connections. Likewise, at least in these cases, the proxy recovers without issue.
In https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/995, I added support for a liveness probe that checks the validity of all cached connections. If the probe fails, Kubernetes will restart the pod. Without steps to reproduce, I think it's best to add a liveness probe to cover this case and close this issue for now.
If you are able to reproduce this, please re-open this and I'll dig in further.
On second thought I’m a little disturbed that this error has popped up in the latest version. I’m going to re-open this.
I just saw this happen today as well, running gcr.io/cloudsql-docker/gce-proxy:1.26.0-buster@sha256:9c9ad91f5a353d71d0815cf9206ee320af7227919005972090f59a07405d2566.
Our application is written in Go.
The database (mysql) upgraded.
We saw the same errors in one of our ~60 pods and they lasted 7 hours before recovering on their own.
We suspect this issue will be resolved as part of the v2 release. I'm going to leave this open in case anyone else has additional reports.
Hi @vikstrous2 and @fbongiovanni29 - are you folks still seeing this issue? If so, can you try to provide some logs of when the proxy does this?
Our understanding is that the proxy should be performing a new refresh and the dial should work after that. You shouldn't been seeing any prolonged instances of this error in recent versions.
Now that v2 is in preview, I'll test to see if we can close this out.
This is what I tried with both the v1 and the v2 proxies:
- Start up the proxy pointing at my Postgres instance
- Connect an app that reads the time from the database using database/sql (a connection pooled interface)
- Curl the app's endpoint every 1s
- Reset the instance's SSL configuration (docs).
This is what I saw in both versions:
- The proxy loses a connection to the instance with a connection reset by peer error
- The app loses the connection
- The app's connection pool creates a new connection
- The proxy connects successfully with a valid SSL configuration.
So, this seems consistent with my earlier comment. So I'm going to close this and assume the issue has been fixed. Note: the v2 proxy is still in public preview (with a GA launch planned early next year), but it's worth trying out as it's a more stable version of the v1 proxy.