Lost leader election may be leading to restarts
Source controller seems to be restarting abruptly, potentially due to lost leader election.
❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME READY STATUS RESTARTS AGE
source-controller-57bff99765-lz9zq 1/1 Running 73 20d
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME READY STATUS RESTARTS AGE
source-controller-57bff99765-pjt4q 1/1 Running 7 20d
Logs before last restart:
❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-06-30T16:20:54.545Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.7.1-anova-k8s-deployment/be9df238f8324b0f62c2338a991a50164fe66500'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"marketer"}
{"level":"info","ts":"2022-06-30T16:20:54.825Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/cb507d8bcd12660b03286f04f4cec3e4e5048fed'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"ml","namespace":"ml"}
{"level":"info","ts":"2022-06-30T16:20:55.236Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/d695fe4c1576eafe37e1bcdc5c85ce7781bb2283'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"transcend","namespace":"transcend"}
{"level":"info","ts":"2022-06-30T16:21:02.600Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/dee5837abc258ed1e920429cef70dd1484cd3dae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"data","namespace":"data"}
{"level":"info","ts":"2022-06-30T16:21:05.469Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/6e3d4626c10d9fc02cb6876546740eb66ec6db2d'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2022-06-30T16:21:12.310Z","logger":"controller.gitrepository","msg":"garbage collected 2 artifacts","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
{"level":"info","ts":"2022-06-30T16:21:14.302Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/3474c21267453b67cbac67f2eae727321bf23f46'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"core","namespace":"core"}
E0630 16:21:15.031656 1 leaderelection.go:367] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/flux-system/leases/source-controller-leader-election": context deadline exceeded
I0630 16:21:15.031717 1 leaderelection.go:283] failed to renew lease flux-system/source-controller-leader-election: timed out waiting for the condition
{"level":"error","ts":"2022-06-30T16:21:23.502Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
On the specific case, the AKS clusters are syncing with the same Azure DevOps git repositories (gitImplementation: libgit2). The deployment has cluster in multiple regions, most of which work as expected, apart from the clusters based in North Europe, which gets restarted every so often.
First reported by @mfamador at https://github.com/fluxcd/source-controller/issues/402#issuecomment-1171758060.
How do we know this is actually is a bug on our side, and not instability of the cluster?
On a side note, when we upgrade controller-runtime the leader election will start using "leases" instead of ConfigMaps: https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.12.0
@hiddeco we don't, therefore I removed the bug label.
@mfamador I noticed that the staging cluster has a 10-fold occurrence rate on this issue. Have you noticed any other metric that could be correlated to indicate why is that so?
I am assuming from an API server perspective, the only difference amongst the two environments would be the availability that Azure provides. Or are both environments configured with the same API server availability?
Can you please also confirm whether this is happening only in North Europe, or across all regions?
@pjbgf we're only using the rc-a00d0edc tag on north European regions due to #402 .
Regarding API server availability, yes, prod-eun has uptime sla activated while stag-eun hasn't.
The other regions we're using v0.25.5 and don't seem to have so many restarts, but the pods are running only for a week.
❯ k get pod -n flux-system -l app=source-controller --context stag-eun
NAME READY STATUS RESTARTS AGE
source-controller-57bff99765-lz9zq 1/1 Running 83 20d
❯ k get pod -n flux-system -l app=source-controller --context stag-ue
NAME READY STATUS RESTARTS AGE
source-controller-dcd8fc8cb-vsqvg 1/1 Running 3 8d
❯ k get pod -n flux-system -l app=source-controller --context prod-ue
NAME READY STATUS RESTARTS AGE
source-controller-dcd8fc8cb-5xt94 1/1 Running 0 8d
❯ k get pod -n flux-system -l app=source-controller --context prod-eun
NAME READY STATUS RESTARTS AGE
source-controller-57bff99765-pjt4q 1/1 Running 7 20d
❯ k describe deploy source-controller -n flux-system --context stag-eun | grep -i image
Image: ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context stag-ue | grep -i image
Image: ghcr.io/fluxcd/source-controller:v0.25.5
❯ k describe deploy source-controller -n flux-system --context prod-eun | grep -i image
Image: ghcr.io/fluxcd/source-controller:rc-a00d0edc
❯ k describe deploy source-controller -n flux-system --context prod-ue | grep -i image
Image: ghcr.io/fluxcd/source-controller:v0.25.5
I'll leave here the latest logs from the most problematic region stag-eun
❯ kubectl logs -n flux-system -l app=source-controller --previous --context stag-eun
{"level":"info","ts":"2022-07-01T10:45:07.434Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"vtm"}
{"level":"info","ts":"2022-07-01T10:45:12.235Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"connectivity"}
{"level":"info","ts":"2022-07-01T10:45:21.347Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v0.4.7-anova-argo-workflow/5caa3023d61b6a11b3de52ad00232853296e92ae'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-workflow-helm-chart-stable","namespace":"data"}
{"level":"info","ts":"2022-07-01T10:45:23.474Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'master/58e004f856d450bb873fef4786623edca2b562f2'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"devices","namespace":"devices"}
{"level":"info","ts":"2022-07-01T10:45:24.055Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.3.6-anova-k8s-deployment/61e09533622381155e867cff9c0beeccfff0049b'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-stable","namespace":"ml"}
{"level":"info","ts":"2022-07-01T10:45:24.778Z","logger":"controller.gitrepository","msg":"no changes since last reconcilation: observed revision 'refs/tags/v2.8.0-anova-k8s-deployment/9e3640a46a17b08e6c9bc506d697b81fd32b5c32'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"anova-helm-chart-test","namespace":"transcend"}
I0701 10:45:25.600013 1 trace.go:205] Trace[1120697951]: "Reflector ListAndWatch" name:k8s.io/[email protected]/tools/cache/reflector.go:167 (01-Jul-2022 10:44:23.736) (total time: 61863ms):
Trace[1120697951]: ---"Objects listed" error:<nil> 61861ms (10:45:25.597)
Trace[1120697951]: [1m1.863913038s] [1m1.863913038s] END
{"level":"error","ts":"2022-07-01T10:45:25.768Z","logger":"setup","msg":"problem running manager","error":"leader election lost"}
@mfamador thanks for sharing the above, I will try to reproduce the issue and get back to you.
By the bye, the changes in rc-a00d0edc are now included in the official releases from v0.25.6+, feel free to upgrade to the latest, so you can have the same version across your fleet.