actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Pods stuck in terminating on AKS cluster with workload identity

Open james-trousdale-lyb opened this issue 3 years ago • 6 comments

Checks

  • [X] I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I'm not using a custom entrypoint in my runner image

Controller Version

0.26.0

Helm Chart Version

0.21.1

CertManager Version

Using AGIC + key vault cert (no issue)

Deployment Method

ArgoCD

cert-manager installation

Issue is not related to certificates. My webhook server works perfectly fine.

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • [X] I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • [X] My actions-runner-controller version (v0.x.y) does support the feature
  • [X] I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • [X] I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-deployment-amd64-14gb
spec:
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    spec:
      organization: LYB-Digital
      labels:
        - linux
        - amd64
        - 14gb
      tolerations:
        - effect: NoSchedule
          key: node-role.kubernetes.io/github-runner-amd64
          operator: Exists
        - effect: NoSchedule
          key: kubernetes.azure.com/scalesetpriority
          value: spot
          operator: Equal
      ephemeral: true
      dockerdWithinRunnerContainer: true
      resources:
        limits:
          cpu: 3800m
          memory: 14Gi
        requests:
          cpu: 3800m
          memory: 14Gi
---
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-scaler-amd64-14gb
spec:
  minReplicas: 0
  maxReplicas: 10
  scaleTargetRef:
    kind: RunnerDeployment
    name: runner-deployment-amd64-14gb
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}
      duration: '30m0s'

To Reproduce

1. Use an AKS cluster with workflow identity support enabled
2. Allow any job to queue and run (successful or not, makes no difference)
3. Job completes, the runner and related resources (e.g., SA, RB, etc) is removed, but the pod gets stuck in terminating state

Describe the bug

After jobs complete, pods are stuck in terminating state, and moreover, they are unable to be patched.

The bug, reported in both the ARC logs as well as when I try to manually patch the resources to remove finalizers, is:

Error from server: admission webhook "mutation.azure-workload-identity.io" denied the request: serviceaccounts "runner-deployment-amd64-14gb-wprcp-mw6nt" not found

It appears that removal of the service account renders the pod in an error state where it can't be interacted with at all.

Describe the expected behavior

Pods should be removed cleanly.

Whole Controller Logs

https://gist.github.com/james-trousdale-lyb/afcc5b15979c2151ea5aaef14d49f369#file-arc-logs-txt

Whole Runner Pod Logs

https://gist.github.com/james-trousdale-lyb/afcc5b15979c2151ea5aaef14d49f369#file-runner-logs-txt

Additional Context

Please note that I did ArgoCD install of inflated helm chart, not manifests directly, FWIW.

Also, I'm unsure if this bug is really with ARC or with the workload identity controller. I know that's a preview feature.

james-trousdale-lyb avatar Jan 10 '23 12:01 james-trousdale-lyb

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

github-actions[bot] avatar Jan 10 '23 12:01 github-actions[bot]

I confirmed that disabling workload identity on the AKS cluster resolves the issue.

james-trousdale-lyb avatar Jan 10 '23 12:01 james-trousdale-lyb

i am also experiencing this

dbg-raghulkrishna avatar Feb 14 '23 14:02 dbg-raghulkrishna

We've also enabled AD workload identity on our cluster and don't have any issues with runner pods not terminating. From my understanding workload identity only considers pods which are labeled with azure.workload.identity/use: "true"

cmergenthaler avatar Jun 13 '23 11:06 cmergenthaler

Based on https://github.com/kubernetes/kubernetes/issues/121828#issuecomment-1804733961 it appears this is a bug in the azure workload identity controller. This seems to be tracked as https://github.com/Azure/azure-workload-identity/issues/647

It can lead to the Job resource being deleted before the Pod has its batch.kubernetes.io/job-tracking finalizer removed, which shouldn't happen per kubernetes 1.26 job tracking.

The webhook should gracefully tolerate the absence of a service account when the pod is being modified to remove a finalizer, instead of failing with serviceaccounts "thepodserviceaccount" not found.

The webhook should not mutate the pod when a finalizer is being removed.

To work around the issue, you can temporarily re-create the service account the mutation.azure-workload-identity.io webhook expects to find, then patch the pod to delete the finalizer.

ringerc avatar Nov 22 '23 23:11 ringerc