actions-runner-controller icon indicating copy to clipboard operation
actions-runner-controller copied to clipboard

Add Customizable Failure Threshold for Ephemeral Runner Retries

Open ali-kafel opened this issue 1 year ago • 0 comments

Checks

  • [X] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
  • [X] I am using charts that are officially provided

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • [X] This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • [X] I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up.

1. Create a runner set with a max amount of any number of runners
2. Fail the runners and let them be marked as failed to approach the runner maximum
3. Try spinning up new runners and you will see the failed runners take up space blocking new runners from starting or capping the amount of new runners we can spin up

Describe the bug

Related to this issue: https://github.com/actions/actions-runner-controller/discussions/3300

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up. We need this to be configurable and somehow clean the failed runners after sometime as well.

Describe the expected behavior

The expected behavior we want is to set the failure threshold so that we can buy more time to catch these failed ephemeral runners. Something like this would be great:

case len(ephemeralRunner.Status.Failures) > failedRetryLimit:

We should be able to set it in the helm chart for the actions runner controller. And if the controller automatically cleaned the failed runners that would be great as well maybe once a day or something.

Additional Context

N/A

Controller Logs

N/A

Runner Pod Logs

N/A

ali-kafel avatar Aug 07 '24 23:08 ali-kafel