Patrick Vinograd comments

Results 23 comments of


                                            Patrick Vinograd

auth: Sporadic errors refreshing token with workload identity federation

Thanks, I'll try that out and see how it performs. My team had one other finding which is that we are not calling `Close` on the aiplatform.PredictionClient. Do you think...

auth: Sporadic errors refreshing token with workload identity federation

I can't either, and yet we clearly saw 100+ requests with the identical `issued at ` in the error. I've been up and down our own code and the google...

auth: Sporadic errors refreshing token with workload identity federation

And I see fileSubjectProvider is ultimately just doing an `io.ReadAll` so it doesn't seem like there's anything stateful happening there.

auth: Sporadic errors refreshing token with workload identity federation

I'm adding logging of the projected token payload when we run into this error, so we'll hopefully be able to isolate it to that part of the system.

auth: Sporadic errors refreshing token with workload identity federation

I was able to simulate a stale k8s service account token. I saved off a k8s token and waited for it to expire. Then I pointed the workload identity client...

auth: Sporadic errors refreshing token with workload identity federation

1. Yes, AWS EKS. 2. Correct, it's a k8s service account volume token that is used as the basis of the token exchange, per the GCP workload identity federation configuration...

Ephemeral Runners seem to get stuck when job is canceled or interrupted

As a datapoint, one way we've consistently been able to trigger this is if the CI workflow being run OOMs.

An error occurred: Runner not found

I believe this is the same or similar as #3819 - e.g. see [this comment](https://github.com/actions/runner/issues/3819#issuecomment-2884499460) with the "Runner not found" error in the self-hosted runner logs.

An error occurred: Runner not found

Echoing this (I think the same as the last couple comments) - we are seeing a new failure mode where we now have a large number of runners showing in...

An error occurred: Runner not found

This is even worse than the previous failure mode because the pods stay running/consuming resources on our cluster.