cloud-sql-proxy Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout

Bug Description

I am running v2.9.0/cloud-sql-proxy.linux.amd64 on a GCE n1-highmem-4 instance that is started through dataflow. The binary runs inside a u22 base image container.

As part of the container entry point, I run the following script:

CLOUD_SQL_PROXY_INSTANCES="<db-instance-name>"

/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &

# Wait for all instances
for instance in ${CLOUD_SQL_PROXY_INSTANCES//,/ } ; do
    while ! pg_isready -h "/cloudsql/${instance}" ; do
        echo "Waiting for instance ${instance} to be online"
        sleep 1
    done
done

Occasionally, I will get the following error:

2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

And then the script gets stuck in an infinite loop, because cloud-sql-proxy quits instead of trying to connect again on the next attempt. Some of it is my fault, I should be using a process manager, but the timeout is unexpected.

The gce instance should not be throttling, and it runs in the same region as the cloud-sql instance. I do not know how to check if the other side of the auth-proxy is having issues. A lot of other instances are also connecting to is (mostly GAE), and I also see these similar issues on them occasionally.

PS: This is a new issue as a follow up for this comment I posted in a different issue: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2081#issuecomment-2070406583

Steps to reproduce?

No easy reproduction steps, since this happens occasionally (~a few times a week).

Environment

OS type and version: Ubuntu 22
Cloud SQL Proxy version (./cloud-sql-proxy --version): v2.9.0
Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /bin/cloud_sql_proxy --private-ip -u /cloudsql $INSTANCE &

May 01 '24 21:05 akshetpandey

Thanks @akshetpandey.

The root problem seems to be this:

Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

The SQL Admin API call isn't responding and the Proxy dies. We recently added retry support for 50x responses here: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/pull/781. I wonder if we should extend that to include more generic TLS errors.

May 03 '24 18:05 enocom

I ended up changing my script to check if the pid is still alive and to restart it if it is not. An internal retry will definitely address the issue too.

I do want to add that something seems fishy here. Not sure if its the container, dataflow, sql admin, network routing, dns, or something else, but the error happens way too frequently.

I don't have concrete data but the failure rate I am seeing implies that sqladmin.googleapis.com has an uptime <99% in this particular situation/setup.

Do note that this isn't the first request made in the flow. My script successfully hits the metadata server first and then this fails.

May 03 '24 22:05 akshetpandey

What kind of CPU usage do you have on this instance? Wondering if this is a client error.

May 06 '24 16:05 enocom

n1-highmem-4, cpu usage at that point is pretty low.

May 06 '24 16:05 akshetpandey

Screenshot 2024-05-06 at 1 48 12 PM Screenshot 2024-05-06 at 1 49 47 PM

Cpu usage is down to under 10% when we start cloud-sql-proxy. Let me know if you want more logs. I am not sure what is available, but it will be all gone in 20 days.

May 06 '24 20:05 akshetpandey

How many instances are you connecting to in your script?

May 08 '24 15:05 enocom

Just the 1

May 11 '24 19:05 akshetpandey

I know we have seen some interesting behavior with dataflow in the past...

I'll try and see if I can get a long-running dataflow setup spun up to see if I can get a repro of the error.

Otherwise we can take a look at extending retries to TLS handshake errors as well.

Aug 08 '24 13:08 jackwotherspoon

Do note, I haven't run into this issue since upgrading to 2.11.3 in may.

Aug 09 '24 06:08 akshetpandey

Do note, I haven't run into this issue since upgrading to 2.11.3 in may.

In that case I will close this out for the time being. If I see anything come up in my test or if you see the issue again please re-open this and we will take another look 😄

Aug 09 '24 15:08 jackwotherspoon