autoscaler failed to update lock
/area autoscale
Getting the following error when installing the latest version of Kubeflow (v1.5.0) on GCP GKE, that is using knative/serving v0.22.0.
Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again
The error usually happens every couple seconds after the autoscaler pod is up for several minutes without issue.
Same error is mentioned here: https://github.com/knative/serving/issues/11101
Tried updating to knative/serving manually to v0.23.1 but the same error happens.
Any ideas how this can be resolved?
The error usually happens every couple seconds after the autoscaler pod is up for several minutes without issue.
Does it eventually resolve itself? And does autoscaling work even with those log messages?
Another thing to note is Knative v0.22 went end of life back in September. Not sure if your Knative was installed separately or as part of Kubeflow, but if it's the latter you might want to see if Kubeflow can update their version of Knative...
Thanks for replying.
Does it eventually resolve itself?
no
And does autoscaling work even with those log messages?
not sure - just noticed all of the errors after deploy
Another thing to note is Knative v0.22 went end of life back in September. Not sure if your Knative was installed separately or as part of Kubeflow, but if it's the latter you might want to see if Kubeflow can update their version of Knative...
It is via Kubeflow, but we tried upgrading to 0.23.1 with no change in this error. May upgrade again.
Found this issue: https://github.com/knative/serving/issues/11544
Activator doesn't restart constantly, but it does throw errors connecting to autoscaler after autoscaler starts throwing the "failed to update lock" errors.
That issue links to: https://github.com/kubernetes/kubernetes/issues/64924
Indicates that it's a kubernetes dns issue. I don't understand exactly how to deal with it though.
If the error you're having is similar to the one from the Knative issue, it looks like passing a custom resolv.conf to kubelet was the solution.
If I recall correctly, there's a lot that goes on when installing Kubeflow... might also be worth checking with Kubeflow to see if anyone else has encountered a similar issue.
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
This issue or pull request is stale because it has been open for 90 days with no activity.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close
/lifecycle stale
This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.
I think this was solved via - https://github.com/knative/serving/issues/13447