cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

TLS error after Google automated update causes downtime

Open fbongiovanni29 opened this issue 4 years ago • 11 comments

Bug Description

On 2 MySQL 5.7 instances in separate projects we had an "Update" operation automatically performed at 7am EST today. This aligned with an error we started receiving at the same time:

Reading data from project:region:db-name had error: remote error: tls: bad certificate

Restarting our Pods resolved the issue, I understand adding Kubernetes Probes to reboot the container would likely have minimized this issue to under a minute and am implementing them currently.

However, I'm curious if there's any way to avoid this altogether

Example code (or command)

# none

Stacktrace

Reading data from project:region:db-name had error: remote error: tls: bad certificate

How to reproduce

  1. Google Automated Upgrade

Environment

  1. OS type and version: debian
  2. Cloud SQL Proxy version (./cloud_sql_proxy -version): 1.27 (Docker image latest)

fbongiovanni29 avatar Nov 09 '21 19:11 fbongiovanni29

Related: https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/400.

enocom avatar Nov 09 '21 19:11 enocom

Also https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/980.

enocom avatar Nov 09 '21 19:11 enocom

Out of curiosity are you using client-side connection pooling or similar to reconnect bad connections? We merged https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/817 thinking it would fix this problem.

enocom avatar Nov 09 '21 19:11 enocom

@enocom I'm not 100% sure how to know that but judging off the managing connections documentation, I think we aren't. We're using ruby (rails) and have the following setup:

production:
  adapter:  <%= ENV["DB_ADAPTER"] || 'mysql2' %>
  encoding: utf8
  host:     <%= ENV['MYSQL_HOST'] %>
  username: <%= ENV['MYSQL_USERNAME'] %>
  password: <%= ENV['MYSQL_PASSWORD'] %>
  database: <%= ENV['MYSQL_DATABASE'] %>
  port:     <%= ENV['MYSQL_PORT'] %>
  collation: utf8_general_ci
  pool: 10

Which doesn't have the socket setting set like the doc. The pool setting is also 2x higher then in the documentation, I'm not sure why someone set it to 10, all I can see in our git history is that it was bumped from 2

fbongiovanni29 avatar Nov 09 '21 19:11 fbongiovanni29

It's been awhile since I've worked with ActiveRecord and Rails and so I'm not certain, but it looks like other people have had similar problems. See https://github.com/rails/rails/issues/38324 for example and a proposed workaround which goes hand-in-hand with Kubernetes probes.

I suspect what's happening here is the connection pool (size 10) all go stale during the update operation. Rails doesn't reset any of the connections and continues trying to use bad connections. I imagine you're seeing the "bad certificate" error repeatedly in the proxy logs. Is that correct?

Since this appears to be an issue with how Rails handles connection pooling, I recommend adding a liveness healthcheck as you suggest. In the meantime, I'm going to downgrade this to a P2 and investigate further.

enocom avatar Nov 09 '21 22:11 enocom

Appreciate the help! I added in the probes for the proxy sidecar and already have similar probes in the Rails workaround which was failing and restarting the Rails container when this happened.

We are seeing the "bad certificate" error repeatedly as well

The additional probes should be sufficient for us now though

fbongiovanni29 avatar Nov 10 '21 15:11 fbongiovanni29

Here's what I tried:

  1. Started a simple Rails 6.1.4.1 app with a Cloud SQL MySQL database with RAILS_ENV set to production
  2. Confirmed that my database connection worked with the proxy running locally
  3. Stopped my database and confirmed the database connection was broken
  4. Started the database and confirmed I could connect with the mysql CLI and also the app without restarting the proxy.

In addition to stopping and starting the database, I also tried rotating the server CA cert and triggering a failover. In all cases, the app and the CLI managed to get a new connection without error. So it seems that Rails isn't holding on to stale connections. Likewise, at least in these cases, the proxy recovers without issue.

In https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/995, I added support for a liveness probe that checks the validity of all cached connections. If the probe fails, Kubernetes will restart the pod. Without steps to reproduce, I think it's best to add a liveness probe to cover this case and close this issue for now.

If you are able to reproduce this, please re-open this and I'll dig in further.

enocom avatar Nov 10 '21 22:11 enocom

On second thought I’m a little disturbed that this error has popped up in the latest version. I’m going to re-open this.

enocom avatar Nov 11 '21 02:11 enocom

I just saw this happen today as well, running gcr.io/cloudsql-docker/gce-proxy:1.26.0-buster@sha256:9c9ad91f5a353d71d0815cf9206ee320af7227919005972090f59a07405d2566. Our application is written in Go. The database (mysql) upgraded. We saw the same errors in one of our ~60 pods and they lasted 7 hours before recovering on their own.

vikstrous2 avatar Nov 11 '21 15:11 vikstrous2

We suspect this issue will be resolved as part of the v2 release. I'm going to leave this open in case anyone else has additional reports.

enocom avatar Dec 02 '21 17:12 enocom

Hi @vikstrous2 and @fbongiovanni29 - are you folks still seeing this issue? If so, can you try to provide some logs of when the proxy does this?

Our understanding is that the proxy should be performing a new refresh and the dial should work after that. You shouldn't been seeing any prolonged instances of this error in recent versions.

kurtisvg avatar Mar 07 '22 17:03 kurtisvg

Now that v2 is in preview, I'll test to see if we can close this out.

enocom avatar Oct 11 '22 16:10 enocom

This is what I tried with both the v1 and the v2 proxies:

  1. Start up the proxy pointing at my Postgres instance
  2. Connect an app that reads the time from the database using database/sql (a connection pooled interface)
  3. Curl the app's endpoint every 1s
  4. Reset the instance's SSL configuration (docs).

This is what I saw in both versions:

  1. The proxy loses a connection to the instance with a connection reset by peer error
  2. The app loses the connection
  3. The app's connection pool creates a new connection
  4. The proxy connects successfully with a valid SSL configuration.

So, this seems consistent with my earlier comment. So I'm going to close this and assume the issue has been fixed. Note: the v2 proxy is still in public preview (with a GA launch planned early next year), but it's worth trying out as it's a more stable version of the v1 proxy.

enocom avatar Oct 24 '22 18:10 enocom