Pods stuck in terminating on AKS cluster with workload identity
Checks
- [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
- [X] I'm not using a custom entrypoint in my runner image
Controller Version
0.26.0
Helm Chart Version
0.21.1
CertManager Version
Using AGIC + key vault cert (no issue)
Deployment Method
ArgoCD
cert-manager installation
Issue is not related to certificates. My webhook server works perfectly fine.
Checks
- [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
- [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
- [X] My actions-runner-controller version (v0.x.y) does support the feature
- [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
- [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)
Resource Definitions
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-deployment-amd64-14gb
spec:
template:
metadata:
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
spec:
organization: LYB-Digital
labels:
- linux
- amd64
- 14gb
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/github-runner-amd64
operator: Exists
- effect: NoSchedule
key: kubernetes.azure.com/scalesetpriority
value: spot
operator: Equal
ephemeral: true
dockerdWithinRunnerContainer: true
resources:
limits:
cpu: 3800m
memory: 14Gi
requests:
cpu: 3800m
memory: 14Gi
---
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: runner-scaler-amd64-14gb
spec:
minReplicas: 0
maxReplicas: 10
scaleTargetRef:
kind: RunnerDeployment
name: runner-deployment-amd64-14gb
scaleUpTriggers:
- githubEvent:
workflowJob: {}
duration: '30m0s'
To Reproduce
1. Use an AKS cluster with workflow identity support enabled
2. Allow any job to queue and run (successful or not, makes no difference)
3. Job completes, the runner and related resources (e.g., SA, RB, etc) is removed, but the pod gets stuck in terminating state
Describe the bug
After jobs complete, pods are stuck in terminating state, and moreover, they are unable to be patched.
The bug, reported in both the ARC logs as well as when I try to manually patch the resources to remove finalizers, is:
Error from server: admission webhook "mutation.azure-workload-identity.io" denied the request: serviceaccounts "runner-deployment-amd64-14gb-wprcp-mw6nt" not found
It appears that removal of the service account renders the pod in an error state where it can't be interacted with at all.
Describe the expected behavior
Pods should be removed cleanly.
Whole Controller Logs
https://gist.github.com/james-trousdale-lyb/afcc5b15979c2151ea5aaef14d49f369#file-arc-logs-txt
Whole Runner Pod Logs
https://gist.github.com/james-trousdale-lyb/afcc5b15979c2151ea5aaef14d49f369#file-runner-logs-txt
Additional Context
Please note that I did ArgoCD install of inflated helm chart, not manifests directly, FWIW.
Also, I'm unsure if this bug is really with ARC or with the workload identity controller. I know that's a preview feature.
Hello! Thank you for filing an issue.
The maintainers will triage your issue shortly.
In the meantime, please take a look at the troubleshooting guide for bug reports.
If this is a feature request, please review our contribution guidelines.
I confirmed that disabling workload identity on the AKS cluster resolves the issue.
i am also experiencing this
We've also enabled AD workload identity on our cluster and don't have any issues with runner pods not terminating. From my understanding workload identity only considers pods which are labeled with azure.workload.identity/use: "true"
Based on https://github.com/kubernetes/kubernetes/issues/121828#issuecomment-1804733961 it appears this is a bug in the azure workload identity controller. This seems to be tracked as https://github.com/Azure/azure-workload-identity/issues/647
It can lead to the Job resource being deleted before the Pod has its batch.kubernetes.io/job-tracking finalizer removed, which shouldn't happen per kubernetes 1.26 job tracking.
The webhook should gracefully tolerate the absence of a service account when the pod is being modified to remove a finalizer, instead of failing with serviceaccounts "thepodserviceaccount" not found.
The webhook should not mutate the pod when a finalizer is being removed.
To work around the issue, you can temporarily re-create the service account the mutation.azure-workload-identity.io webhook expects to find, then patch the pod to delete the finalizer.