serving autoscaler failed to update lock

/area autoscale

Getting the following error when installing the latest version of Kubeflow (v1.5.0) on GCP GKE, that is using knative/serving v0.22.0.

Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "autoscaler-bucket-00-of-01": the object has been modified; please apply your changes to the latest version and try again

The error usually happens every couple seconds after the autoscaler pod is up for several minutes without issue.

Same error is mentioned here: https://github.com/knative/serving/issues/11101

Tried updating to knative/serving manually to v0.23.1 but the same error happens.

Any ideas how this can be resolved?

Jun 29 '22 20:06 gbark13

The error usually happens every couple seconds after the autoscaler pod is up for several minutes without issue.

Does it eventually resolve itself? And does autoscaling work even with those log messages?

Another thing to note is Knative v0.22 went end of life back in September. Not sure if your Knative was installed separately or as part of Kubeflow, but if it's the latter you might want to see if Kubeflow can update their version of Knative...

Jun 30 '22 14:06 psschwei

Thanks for replying.

Does it eventually resolve itself?

no

And does autoscaling work even with those log messages?

not sure - just noticed all of the errors after deploy

Another thing to note is Knative v0.22 went end of life back in September. Not sure if your Knative was installed separately or as part of Kubeflow, but if it's the latter you might want to see if Kubeflow can update their version of Knative...

It is via Kubeflow, but we tried upgrading to 0.23.1 with no change in this error. May upgrade again.

Found this issue: https://github.com/knative/serving/issues/11544

Activator doesn't restart constantly, but it does throw errors connecting to autoscaler after autoscaler starts throwing the "failed to update lock" errors.

That issue links to: https://github.com/kubernetes/kubernetes/issues/64924

Indicates that it's a kubernetes dns issue. I don't understand exactly how to deal with it though.

Jun 30 '22 15:06 gbark13

If the error you're having is similar to the one from the Knative issue, it looks like passing a custom resolv.conf to kubelet was the solution.

If I recall correctly, there's a lot that goes on when installing Kubeflow... might also be worth checking with Kubeflow to see if anyone else has encountered a similar issue.

Jun 30 '22 16:06 psschwei

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Sep 29 '22 01:09 github-actions[bot]

This issue or pull request is stale because it has been open for 90 days with no activity.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

Oct 29 '22 02:10 knative-prow-robot

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

Jan 29 '23 01:01 github-actions[bot]

I think this was solved via - https://github.com/knative/serving/issues/13447

Mar 29 '23 19:03 dprotaso