Intermittent "connection aborted - error reading from instance" errors with auth proxy as a sidecar on Cloud Run
Bug Description
I have a Cloud Run service running with a cloud sql auth proxy sidecar to connect to a set of CloudSQL instances (currently, 5 of them). Several instances of the service can coexist at any given time. Sometimes, with increasing frequency (used to be once a month or so, it's getting to several times a week recently), all the connections to CloudSQL in once instance error out with the following error logs
'[project_id:europe-west1:instance_2] connection aborted - error reading from instance: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_2] IO Error on Read or Write: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
It always happens on all connected instances at the same time, for one given instance of the proxy. As far as we have been able to observe, there is no visible correlation between this issue occurring and any sort of high load on the cloud run service, or the databases it connects to.
Example code (or command)
Intermittent error that does not seem related to any particular lines of code (see below for proxy options).
Stacktrace
'[project_id:europe-west1:instance_2] connection aborted - error reading from instance: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_2] IO Error on Read or Write: read tcp 169.254.8.1:60699->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] connection aborted - error reading from instance: read tcp 169.254.8.1:58109->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] IO Error on Read or Write: read tcp 169.254.8.1:58109->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] connection aborted - error reading from instance: read tcp 169.254.8.1:33878->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] IO Error on Read or Write: read tcp 169.254.8.1:33878->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] connection aborted - error reading from instance: read tcp 169.254.8.1:44952->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] IO Error on Read or Write: read tcp 169.254.8.1:44952->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] connection aborted - error reading from instance: read tcp 169.254.8.1:29766->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] connection aborted - error reading from instance: read tcp 169.254.8.1:60901->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] connection aborted - error reading from instance: read tcp 169.254.8.1:53263->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_4] IO Error on Read or Write: read tcp 169.254.8.1:29766->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_3] IO Error on Read or Write: read tcp 169.254.8.1:60901->{instance_ip}:3307: read: connection reset by peer
'[project_id:europe-west1:instance_1] IO Error on Read or Write: read tcp 169.254.8.1:53263->{instance_ip}:3307: read: connection reset by peer
Steps to reproduce?
I don't really trigger the bug, it just happens sometimes. The frequency seems to be increasing recently.
Environment
- OS type and version: Docker container on Cloud Run
- The sidecar container so far had 500m vCPU allocated (half a vCPU) - I changed it to 1 full vCPU today, waiting to see if the issue occurs again.
- Cloud SQL Proxy version : gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.16.0
- Proxy invocation command :
args = [
"--unix-socket=/cloudsql",
"--structured-logs",
"--health-check",
"--http-address=0.0.0.0",
"--max-sigterm-delay=10s", // wait 10sec max before closing all connections when the container receives SIGTERM. Should be longer than the condition applied in the client code, if any.
"--debug-logs",
"--lazy-refresh",
]
"--lazy-refresh", has been recently added to see if it fixes the issue, to no avail.
Additional Details
No response
Hi, Just wanted to post a quick update on this, to note that our last tentative fix to attribute 1 full vCPU to the sidecar container did not fix the issue, which happened again for us tonight.
Hi @Pascal-Delange,
Is this issue causing any interruptions in active database connections from the application, and causing errors in the application?
Hi @kgala2 , Sorry I didn't get the reply notification. Yes, this is frequently interrupting active connections, resulting requests erroring out.
Hi @Pascal-Delange,
When you see connection aborted - error reading from instance it means the Cloud SQL Proxy can establish a connection, but it's then dropped because of a network failure. We recommend either:
- Handling the network failure directly within your application and re-establishing the connection when an error occurs.
- Raising a support case and linking this GitHub issue. This will give us access to your instance logs to investigate further.
Hi, Yes that's what I thought, since it seems to drop all active connections to different instances at the same time. Handling the error is tricky because in the application layer it's hard to distinguish those from the error we would have if we just failed to create one new connection - and in any case if there is a pre-existing network error I'm afraid the stampede from all connections retrying would not be helpful ?
If you you think it's indeed caused by the network infrastructure, I'll open a ticket the next time this happens.
I'm going to close this for now. Please feel free to reopen it if you have more to add.