cloud-sql-proxy icon indicating copy to clipboard operation
cloud-sql-proxy copied to clipboard

"failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata..." intermittent connection errors

Open aronsoderling opened this issue 1 year ago • 7 comments

Bug Description

Running cloud-sql-proxy as a sidecar container in GKE with CSQL_PROXY_AUTO_IAM_AUTHN=true and CSQL_PROXY_CREDENTIALS_FILE=PATH_TO_CREDENTIALS intermittently fails to connect to the database instance with the following error message:

failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "INSTANCE_CONNECTION_NAME"): googleapi: Error 401: Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.

With ACCESS_TOKEN_EXPIRED as reason in the error details (see stacktrace).

Subsequent connection attempts succeed.

This doesn't happen very often, we're seeing roughly 1 error per 10000 requests in our system over a month. Can't really say how many connection attempts those requests corresponds to. We've been seeing these errors randomly for quite some time, at least for the past three months.

Example code (or command)

Configuration looks like this:

CSQL_PROXY_PORT=5432
CSQL_PROXY_AUTO_IAM_AUTHN=true
CSQL_PROXY_CREDENTIALS_FILE=PATH_TO_CREDENTIALS
CSQL_PROXY_QUIET=true
CSQL_PROXY_HEALTH_CHECK=true
CSQL_PROXY_HTTP_PORT=9801
CSQL_PROXY_HTTP_ADDRESS=0.0.0.0
CSQL_PROXY_EXIT_ZERO_ON_SIGTERM=true
CSQL_PROXY_STRUCTURED_LOGS=true`

Stacktrace

failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "INSTANCE_CONNECTION_NAME"): googleapi: Error 401: Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.

Details:
[
  {
    "@type": "type.googleapis.com/google.rpc.ErrorInfo",
    "domain": "googleapis.com",
    "metadata": {
      "method": "google.cloud.sql.v1beta4.SqlConnectService.GetConnectSettings",
      "service": "sqladmin.googleapis.com"
    },
    "reason": "ACCESS_TOKEN_EXPIRED"
  }
]
{}
More details:
Reason: authError, Message: Invalid Credentials

Steps to reproduce?

I haven't been able to reproduce this locally.

Environment

  1. Docker
  2. gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.13.0
  3. cloud-sql-proxy INSTANCE_CONNECTION_NAME

Additional Details

I found this issue https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2212 which sounds similar.

aronsoderling avatar Jan 15 '25 15:01 aronsoderling

Hi @aronsoderling,

Thanks for the report. This is a very strange intermittent issue, but it fits with a pattern of issues we have noticed. When the proxy is configured to use IAM Auth, it needs to request an extra auth token. We have observed that in GKE, sometimes the Cloud SQL Auth Proxy (and the Cloud SQL Java Connector) will attempt to use an invalid auth token. We have been unable to reliably reproduce this either.

We are attempting to find the root cause. I am going to leave this issue open in case other members of the community experience something similar.

hessjcg avatar Jan 16 '25 02:01 hessjcg

Thanks for acknowledging @hessjcg. Let me know if there's any way I can help out

aronsoderling avatar Jan 16 '25 08:01 aronsoderling

@aronsoderling Couple small updates from our end.

I just wanted to point that in our release last week, v2.14.3, we bumped the dependency on the Cloud SQL Go Connector to a new version which has switched over to using the newer auth Go package over the older oauth2 package.

I'm somewhat optimistic that this may help reduce some of these auth token related intermittent issues.

I will also internally reach out to the GKE metadata server team to see about opening up a bug with them. It may very well be that the metadata server is responsible for returning bad tokens to the Proxy which is why we have seen a couple GKE cases similar to this.

Will keep you posted on further updates 😄

jackwotherspoon avatar Jan 20 '25 19:01 jackwotherspoon

I just wanted to point that in our release last week, v2.14.3, we bumped the dependency on the Cloud SQL Go Connector to a new version which has switched over to using the newer auth Go package over the older oauth2 package.

I'm somewhat optimistic that this may help reduce some of these auth token related intermittent issues.

Nice! I'll go ahead and update to that version on our end and hopefully that solves the issue 🤞

aronsoderling avatar Jan 22 '25 14:01 aronsoderling

Hi @jackwotherspoon and @hessjcg,

We're experiencing frequent errors with the cloud-sql-proxy sidecar. These error logs are present on both 2.6.1 and the suggested 2.14.3 images:

Authorizing with Application Default Credentials

Error starting proxy: [PROJECT:REGION:INSTANCE] Unable to mount socket: failed to get instance: 
Refresh error: failed to get instance metadata (connection name = "PROJECT:REGION:INSTANCE"): 
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/PROJECT/instances/INSTANCE/connectSettings?alt=json&prettyPrint=false": 
credentials: cannot fetch token: Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https://www.googleapis.com/auth/sqlservice.admin": 
dial tcp 169.254.169.254:80: connect: connection refused

The proxy has encountered a terminal error: unable to start: [PROJECT:REGION:INSTANCE] Unable to mount socket: failed to get instance: 
Refresh error: failed to get instance metadata (connection name = "PROJECT:REGION:INSTANCE"): 
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/PROJECT/instances/INSTANCE/connectSettings?alt=json&prettyPrint=false": 
credentials: cannot fetch token: Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https://www.googleapis.com/auth/sqlservice.admin": 
dial tcp 169.254.169.254:80: connect: connection refused

One item of note is that on standard k8s deployments, the issue very rarely occurs. However, it is extremely prevalent on ephemeral Job pods from our k8s cronjob spec.

Please advise if you have any suggestions or insight. Thank you!

aprettyloner avatar Apr 30 '25 19:04 aprettyloner

Here's another report of a similar connection issue: #2450

hessjcg avatar May 27 '25 15:05 hessjcg

Hi again!

Completely forgot about updating this issue, but good news: we haven't seen this issue again since updating the cloud-sql-proxy. I think the 2.14.3 update did the trick, and we've been running v2.15.1 for the past couple months without seeing this issue.

aronsoderling avatar May 28 '25 04:05 aronsoderling

Hi everyone 👋

I ran into the same issue on version 2.15.3.

Also it might be a coincidence but none of our other application pods (non kube-system pods) could start on the node where proxy was failing to authenticate.

Logs showed this error:

Failed to create pod sandbox: rpc error: code = Unknown desc = 
  failed to create containerd task: failed to create shim task: 
  OCI runtime create failed: runc create failed: unable to start 
  container process: unable to init seccomp: error loading 
  seccomp filter into kernel: error loading seccomp filter: 
  errno 524: unknown

Not sure if both errors (failed to get instance meta and loading seccomp filter) are symptoms of single problem or completely separate things.

We are running on GKE Autopilot 1.32.4-gke.1415000.

mtumalewicz avatar Jul 01 '25 08:07 mtumalewicz

@mtumalewicz, the "loading seccomp filter" is a different, kubernetes problem, not related to the proxy.

@aronsoderling, Great to hear that this is now working. I'm going to close this issue.

hessjcg avatar Aug 12 '25 17:08 hessjcg