"failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata..." intermittent connection errors
Bug Description
Running cloud-sql-proxy as a sidecar container in GKE with CSQL_PROXY_AUTO_IAM_AUTHN=true and CSQL_PROXY_CREDENTIALS_FILE=PATH_TO_CREDENTIALS intermittently fails to connect to the database instance with the following error message:
failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "INSTANCE_CONNECTION_NAME"): googleapi: Error 401: Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.
With ACCESS_TOKEN_EXPIRED as reason in the error details (see stacktrace).
Subsequent connection attempts succeed.
This doesn't happen very often, we're seeing roughly 1 error per 10000 requests in our system over a month. Can't really say how many connection attempts those requests corresponds to. We've been seeing these errors randomly for quite some time, at least for the past three months.
Example code (or command)
Configuration looks like this:
CSQL_PROXY_PORT=5432
CSQL_PROXY_AUTO_IAM_AUTHN=true
CSQL_PROXY_CREDENTIALS_FILE=PATH_TO_CREDENTIALS
CSQL_PROXY_QUIET=true
CSQL_PROXY_HEALTH_CHECK=true
CSQL_PROXY_HTTP_PORT=9801
CSQL_PROXY_HTTP_ADDRESS=0.0.0.0
CSQL_PROXY_EXIT_ZERO_ON_SIGTERM=true
CSQL_PROXY_STRUCTURED_LOGS=true`
Stacktrace
failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "INSTANCE_CONNECTION_NAME"): googleapi: Error 401: Request had invalid authentication credentials. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.
Details:
[
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"domain": "googleapis.com",
"metadata": {
"method": "google.cloud.sql.v1beta4.SqlConnectService.GetConnectSettings",
"service": "sqladmin.googleapis.com"
},
"reason": "ACCESS_TOKEN_EXPIRED"
}
]
{}
More details:
Reason: authError, Message: Invalid Credentials
Steps to reproduce?
I haven't been able to reproduce this locally.
Environment
- Docker
-
gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.13.0 -
cloud-sql-proxy INSTANCE_CONNECTION_NAME
Additional Details
I found this issue https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/2212 which sounds similar.
Hi @aronsoderling,
Thanks for the report. This is a very strange intermittent issue, but it fits with a pattern of issues we have noticed. When the proxy is configured to use IAM Auth, it needs to request an extra auth token. We have observed that in GKE, sometimes the Cloud SQL Auth Proxy (and the Cloud SQL Java Connector) will attempt to use an invalid auth token. We have been unable to reliably reproduce this either.
We are attempting to find the root cause. I am going to leave this issue open in case other members of the community experience something similar.
Thanks for acknowledging @hessjcg. Let me know if there's any way I can help out
@aronsoderling Couple small updates from our end.
I just wanted to point that in our release last week, v2.14.3, we bumped the dependency on the Cloud SQL Go Connector to a new version which has switched over to using the newer auth Go package over the older oauth2 package.
I'm somewhat optimistic that this may help reduce some of these auth token related intermittent issues.
I will also internally reach out to the GKE metadata server team to see about opening up a bug with them. It may very well be that the metadata server is responsible for returning bad tokens to the Proxy which is why we have seen a couple GKE cases similar to this.
Will keep you posted on further updates 😄
I just wanted to point that in our release last week, v2.14.3, we bumped the dependency on the Cloud SQL Go Connector to a new version which has switched over to using the newer
authGo package over the olderoauth2package.I'm somewhat optimistic that this may help reduce some of these auth token related intermittent issues.
Nice! I'll go ahead and update to that version on our end and hopefully that solves the issue 🤞
Hi @jackwotherspoon and @hessjcg,
We're experiencing frequent errors with the cloud-sql-proxy sidecar. These error logs are present on both 2.6.1 and the suggested 2.14.3 images:
Authorizing with Application Default Credentials
Error starting proxy: [PROJECT:REGION:INSTANCE] Unable to mount socket: failed to get instance:
Refresh error: failed to get instance metadata (connection name = "PROJECT:REGION:INSTANCE"):
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/PROJECT/instances/INSTANCE/connectSettings?alt=json&prettyPrint=false":
credentials: cannot fetch token: Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https://www.googleapis.com/auth/sqlservice.admin":
dial tcp 169.254.169.254:80: connect: connection refused
The proxy has encountered a terminal error: unable to start: [PROJECT:REGION:INSTANCE] Unable to mount socket: failed to get instance:
Refresh error: failed to get instance metadata (connection name = "PROJECT:REGION:INSTANCE"):
Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/PROJECT/instances/INSTANCE/connectSettings?alt=json&prettyPrint=false":
credentials: cannot fetch token: Get "http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https://www.googleapis.com/auth/sqlservice.admin":
dial tcp 169.254.169.254:80: connect: connection refused
One item of note is that on standard k8s deployments, the issue very rarely occurs. However, it is extremely prevalent on ephemeral Job pods from our k8s cronjob spec.
Please advise if you have any suggestions or insight. Thank you!
Here's another report of a similar connection issue: #2450
Hi again!
Completely forgot about updating this issue, but good news: we haven't seen this issue again since updating the cloud-sql-proxy. I think the 2.14.3 update did the trick, and we've been running v2.15.1 for the past couple months without seeing this issue.
Hi everyone 👋
I ran into the same issue on version 2.15.3.
Also it might be a coincidence but none of our other application pods (non kube-system pods) could start on the node where proxy was failing to authenticate.
Logs showed this error:
Failed to create pod sandbox: rpc error: code = Unknown desc =
failed to create containerd task: failed to create shim task:
OCI runtime create failed: runc create failed: unable to start
container process: unable to init seccomp: error loading
seccomp filter into kernel: error loading seccomp filter:
errno 524: unknown
Not sure if both errors (failed to get instance meta and loading seccomp filter) are symptoms of single problem or completely separate things.
We are running on GKE Autopilot 1.32.4-gke.1415000.
@mtumalewicz, the "loading seccomp filter" is a different, kubernetes problem, not related to the proxy.
@aronsoderling, Great to hear that this is now working. I'm going to close this issue.