Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout
Bug Description
I am running v2.9.0/cloud-sql-proxy.linux.amd64 on a GCE n1-highmem-4 instance that is started through dataflow. The binary runs inside a u22 base image container.
As part of the container entry point, I run the following script:
CLOUD_SQL_PROXY_INSTANCES="<db-instance-name>"
/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &
# Wait for all instances
for instance in ${CLOUD_SQL_PROXY_INSTANCES//,/ } ; do
while ! pg_isready -h "/cloudsql/${instance}" ; do
echo "Waiting for instance ${instance} to be online"
sleep 1
done
done
Occasionally, I will get the following error:
2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout
2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout
And then the script gets stuck in an infinite loop, because cloud-sql-proxy quits instead of trying to connect again on the next attempt. Some of it is my fault, I should be using a process manager, but the timeout is unexpected.
The gce instance should not be throttling, and it runs in the same region as the cloud-sql instance. I do not know how to check if the other side of the auth-proxy is having issues. A lot of other instances are also connecting to is (mostly GAE), and I also see these similar issues on them occasionally.
PS: This is a new issue as a follow up for this comment I posted in a different issue: https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2081#issuecomment-2070406583
Steps to reproduce?
No easy reproduction steps, since this happens occasionally (~a few times a week).
Environment
- OS type and version: Ubuntu 22
- Cloud SQL Proxy version (
./cloud-sql-proxy --version):v2.9.0 - Proxy invocation command (for example,
./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME):/bin/cloud_sql_proxy --private-ip -u /cloudsql $INSTANCE &
Thanks @akshetpandey.
The root problem seems to be this:
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout
The SQL Admin API call isn't responding and the Proxy dies. We recently added retry support for 50x responses here: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/pull/781. I wonder if we should extend that to include more generic TLS errors.
I ended up changing my script to check if the pid is still alive and to restart it if it is not. An internal retry will definitely address the issue too.
I do want to add that something seems fishy here. Not sure if its the container, dataflow, sql admin, network routing, dns, or something else, but the error happens way too frequently.
I don't have concrete data but the failure rate I am seeing implies that sqladmin.googleapis.com has an uptime <99% in this particular situation/setup.
Do note that this isn't the first request made in the flow. My script successfully hits the metadata server first and then this fails.
What kind of CPU usage do you have on this instance? Wondering if this is a client error.
n1-highmem-4, cpu usage at that point is pretty low.
Cpu usage is down to under 10% when we start cloud-sql-proxy. Let me know if you want more logs. I am not sure what is available, but it will be all gone in 20 days.
How many instances are you connecting to in your script?
Just the 1
I know we have seen some interesting behavior with dataflow in the past...
I'll try and see if I can get a long-running dataflow setup spun up to see if I can get a repro of the error.
Otherwise we can take a look at extending retries to TLS handshake errors as well.
Do note, I haven't run into this issue since upgrading to 2.11.3 in may.
Do note, I haven't run into this issue since upgrading to 2.11.3 in may.
In that case I will close this out for the time being. If I see anything come up in my test or if you see the issue again please re-open this and we will take another look 😄