Patrick Vinograd
Patrick Vinograd
Thanks, I'll try that out and see how it performs. My team had one other finding which is that we are not calling `Close` on the aiplatform.PredictionClient. Do you think...
I can't either, and yet we clearly saw 100+ requests with the identical `issued at ` in the error. I've been up and down our own code and the google...
And I see fileSubjectProvider is ultimately just doing an `io.ReadAll` so it doesn't seem like there's anything stateful happening there.
I'm adding logging of the projected token payload when we run into this error, so we'll hopefully be able to isolate it to that part of the system.
I was able to simulate a stale k8s service account token. I saved off a k8s token and waited for it to expire. Then I pointed the workload identity client...
1. Yes, AWS EKS. 2. Correct, it's a k8s service account volume token that is used as the basis of the token exchange, per the GCP workload identity federation configuration...
As a datapoint, one way we've consistently been able to trigger this is if the CI workflow being run OOMs.
I believe this is the same or similar as #3819 - e.g. see [this comment](https://github.com/actions/runner/issues/3819#issuecomment-2884499460) with the "Runner not found" error in the self-hosted runner logs.
Echoing this (I think the same as the last couple comments) - we are seeing a new failure mode where we now have a large number of runners showing in...
This is even worse than the previous failure mode because the pods stay running/consuming resources on our cluster.