[Issue:3308] SIGTERM Graceful shutdown functionality
This is to make runner compatible with Kubernetes' Karpenter, and in general k8s pod movement . This fixes the https://github.com/actions/runner/issues/3308 by handling graceful shutdown of the runner. It does following.
- If the runner is just listening for jobs and Idle, It will just exit.
- If the runner is running a job, It will wait
RUNNER_GRACEFUL_STOP_TIMEOUTseconds before terminating or job completion whichever happens first.
Any ETA on when can we expect a review on this PR?
This would be really great to get in assuming it works, we're also experiencing this.
Would love to see this merged
This PR is an essential bug fix for using github runner with Karpenter.
Karpenter support is essential to save significant cost savings across all companies. We save easily $300k+ per year, Scaling that across 1000's of tech companies, Karpenter support can easily save a lot and associated CO2 Emissions.
Can we prioritize reviewing and merging this PR?
Upvote for the PR. We ended up implementing a custom image and baking in the script. However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies. Moving dind to a sidecar container has solved it for us - https://github.com/actions/actions-runner-controller/pull/3842
@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Released state after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.
Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier
@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the
Releasedstate after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier
We were seeing errors that connection to the docker socket was lost during an image build. We get a SIGTERM signal and the runner container handles it properly but the dind one doesn't and terminates so docker host disappears and build breaks.
However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies
that's a different issue, and fixed in PR https://github.com/actions/actions-runner-controller/pull/3601
For a while this proposed change seemed to do the trick for our runners however something seems to have changed somewhere where due to the Runner.Worker process is only active and running when a job is in progress the script would end up hanging leaving the Runner.Listener process running:
Received SIGTERM, Graceful shutdown in 1800 Secs ...
error: list of process IDs must follow -p
Usage:
ps [options]
Try 'ps --help <simple|list|output|threads|misc|all>'
or 'ps --help <s|l|o|t|m|a>'
for additional help text.
For more details see ps(1).
Exiting runner...
Despite it suggesting it was exiting, it seems that it does not in fact on all occasions and instead hanging leaving the listener process.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
runner 1 0.0 0.0 4500 3532 ? Ss 12:10 0:00 /bin/bash /home/runner/run.sh
runner 10 0.0 0.0 4368 3228 ? S 12:10 0:00 /bin/bash /home/runner/run-helper.sh
runner 25 0.2 0.0 274048668 109492 ? Sl 12:10 0:01 /home/runner/bin/Runner.Listener run
runner 86 0.0 0.0 4632 3788 pts/0 Ss 12:21 0:00 /bin/bash
runner 101 0.0 0.0 7068 1568 pts/0 R+ 12:21 0:00 ps -aux
I amended the script further to handle exiting cleanly when only this listener is present:
handle_sigterm() {
# Default graceful stop timeout is 3 seconds
RUNNER_GRACEFUL_STOP_TIMEOUT=${RUNNER_GRACEFUL_STOP_TIMEOUT:-3}
echo "Received SIGTERM, " \
"Graceful shutdown in $RUNNER_GRACEFUL_STOP_TIMEOUT Secs ..."
if [ -n "$RUNNER_TOKEN" ]; then
echo "Runner token is still set, de-registering runner..."
idle_runner="/runner/config.sh remove --token $RUNNER_TOKEN"
else
# workaround for Issue#3330
# For the case JITCONFIG is used instead of reg token.
# Fallback to check if worker is running,race condition prone.
worker_process_id=$(pgrep Runner.Worker)
idle_runner="test -z \"$worker_process_id\""
fi
# Check if runner is idle if not then wait for job to finish before stopping
if ! eval $idle_runner; then
echo "Running a job, waiting for $RUNNER_GRACEFUL_STOP_TIMEOUT s to finish.."
i=0
while [[ $i -lt $RUNNER_GRACEFUL_STOP_TIMEOUT ]]; do
echo "Still waiting for job to finish.."
# Check again if runner is idle to handle potential race condition
if [ -z $worker_process_id ]; then
echo "Worker process id not found, trying to find it again.."
worker_process_id=$(pgrep Runner.Worker)
# If worker process id is still not found, exit
if [ -z $worker_process_id ]; then
echo "Worker process id still not found, exiting.."
return
fi
fi
# Check if runner stopped itself
if ! ps -p $worker_process_id > /dev/null; then
echo "Runner stopped itself while graceful period waiting."
return
fi
sleep 1
((i++))
done
echo "Graceful period over, terminating..."
fi
# Graceful wait period over, kill the worker process
# Or if worker process was not found, then check for listener process and kill it
if [ -z $worker_process_id ]; then
echo "Worker process id not found, checking for listener process.."
listener_process_id=$(pgrep Runner.Listener)
if [ -n $listener_process_id ]; then
echo "Killing listener process id: $listener_process_id"
kill -INT $listener_process_id
fi
else
echo "Killing worker process id: $worker_process_id"
kill -INT -$worker_process_id
fi
}