runner icon indicating copy to clipboard operation
runner copied to clipboard

[Issue:3308] SIGTERM Graceful shutdown functionality

Open jaswanthikolla opened this issue 1 year ago • 10 comments

This is to make runner compatible with Kubernetes' Karpenter, and in general k8s pod movement . This fixes the https://github.com/actions/runner/issues/3308 by handling graceful shutdown of the runner. It does following.

  1. If the runner is just listening for jobs and Idle, It will just exit.
  2. If the runner is running a job, It will wait RUNNER_GRACEFUL_STOP_TIMEOUT seconds before terminating or job completion whichever happens first.

jaswanthikolla avatar Jun 14 '24 02:06 jaswanthikolla

Any ETA on when can we expect a review on this PR?

jaswanthikolla avatar Jun 25 '24 22:06 jaswanthikolla

This would be really great to get in assuming it works, we're also experiencing this.

ccincotti3 avatar Sep 11 '24 19:09 ccincotti3

Would love to see this merged

moosh3 avatar Oct 27 '24 21:10 moosh3

This PR is an essential bug fix for using github runner with Karpenter.

joosangkim avatar Nov 04 '24 02:11 joosangkim

Karpenter support is essential to save significant cost savings across all companies. We save easily $300k+ per year, Scaling that across 1000's of tech companies, Karpenter support can easily save a lot and associated CO2 Emissions.

Can we prioritize reviewing and merging this PR?

jaswanthikolla avatar Nov 20 '24 01:11 jaswanthikolla

Upvote for the PR. We ended up implementing a custom image and baking in the script. However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies. Moving dind to a sidecar container has solved it for us - https://github.com/actions/actions-runner-controller/pull/3842

velkovb avatar Dec 12 '24 10:12 velkovb

@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Released state after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.

Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier

alec-drw avatar Dec 12 '24 13:12 alec-drw

@velkovb could I inquire as to the errors you saw when the runner did not capture the signal correctly? I have observed behavior in with ephemeral pvc's get stuck in the Released state after docker fails to cleanly shutdown, leading to an eventual break in the storage provisioner.

Have been leaning towards using the Kubernetes buildkit driver as the solution, but a side car would certainly be easier

We were seeing errors that connection to the docker socket was lost during an image build. We get a SIGTERM signal and the runner container handles it properly but the dind one doesn't and terminates so docker host disappears and build breaks.

velkovb avatar Dec 12 '24 15:12 velkovb

However, we noticed that it is not behaving properly in dind runners as the signal is only captured on the runner container and the docker socket dies

that's a different issue, and fixed in PR https://github.com/actions/actions-runner-controller/pull/3601

jaswanthikolla avatar Jan 28 '25 01:01 jaswanthikolla

For a while this proposed change seemed to do the trick for our runners however something seems to have changed somewhere where due to the Runner.Worker process is only active and running when a job is in progress the script would end up hanging leaving the Runner.Listener process running:

Received SIGTERM,  Graceful shutdown in 1800 Secs ...
error: list of process IDs must follow -p

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
Exiting runner...

Despite it suggesting it was exiting, it seems that it does not in fact on all occasions and instead hanging leaving the listener process.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
runner         1  0.0  0.0   4500  3532 ?        Ss   12:10   0:00 /bin/bash /home/runner/run.sh
runner        10  0.0  0.0   4368  3228 ?        S    12:10   0:00 /bin/bash /home/runner/run-helper.sh
runner        25  0.2  0.0 274048668 109492 ?    Sl   12:10   0:01 /home/runner/bin/Runner.Listener run
runner        86  0.0  0.0   4632  3788 pts/0    Ss   12:21   0:00 /bin/bash
runner       101  0.0  0.0   7068  1568 pts/0    R+   12:21   0:00 ps -aux

I amended the script further to handle exiting cleanly when only this listener is present:

handle_sigterm() {
    # Default graceful stop timeout is 3 seconds
    RUNNER_GRACEFUL_STOP_TIMEOUT=${RUNNER_GRACEFUL_STOP_TIMEOUT:-3}
    echo "Received SIGTERM, " \
        "Graceful shutdown in $RUNNER_GRACEFUL_STOP_TIMEOUT Secs ..."

    if [ -n "$RUNNER_TOKEN" ]; then
        echo "Runner token is still set, de-registering runner..."
        idle_runner="/runner/config.sh remove --token $RUNNER_TOKEN"
    else
        # workaround for Issue#3330
        # For the case JITCONFIG is used instead of reg token.
        # Fallback to check if worker is running,race condition prone.
        worker_process_id=$(pgrep Runner.Worker)
        idle_runner="test -z \"$worker_process_id\""
    fi

    # Check if runner is idle if not then wait for job to finish before stopping
    if ! eval $idle_runner; then
        echo "Running a job, waiting for $RUNNER_GRACEFUL_STOP_TIMEOUT s to finish.."
        i=0
        while [[ $i -lt $RUNNER_GRACEFUL_STOP_TIMEOUT ]]; do
            echo "Still waiting for job to finish.."

            # Check again if runner is idle to handle potential race condition
            if [ -z $worker_process_id ]; then
                echo "Worker process id not found, trying to find it again.."
                worker_process_id=$(pgrep Runner.Worker)
                # If worker process id is still not found, exit
                if [ -z $worker_process_id ]; then
                    echo "Worker process id still not found, exiting.."
                    return
                fi
            fi

            # Check if runner stopped itself
            if ! ps -p $worker_process_id > /dev/null; then
                echo "Runner stopped itself while graceful period waiting."
                return
            fi
            sleep 1
            ((i++))
        done
        echo "Graceful period over, terminating..."
    fi

    # Graceful wait period over, kill the worker process
    # Or if worker process was not found, then check for listener process and kill it
    if [ -z $worker_process_id ]; then
        echo "Worker process id not found, checking for listener process.."
        listener_process_id=$(pgrep Runner.Listener)
        if [ -n $listener_process_id ]; then
            echo "Killing listener process id: $listener_process_id"
            kill -INT $listener_process_id
        fi
    else
        echo "Killing worker process id: $worker_process_id"
        kill -INT -$worker_process_id
    fi
}

marknet15 avatar Mar 10 '25 13:03 marknet15