actions-runner-controller Re-use the last desired patch on empty batch

Fixes #3450

Apr 18 '24 11:04 nikola-jokic

@nikola-jokic, a quick question. Do I understand correctly that the empty message here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns http.StatusAccepted here as a result of the long poll connection after 50 seconds, or does the server response actually come with http.StatusOk here but with an empty body?

I also understand correctly that with each empty message, the PatchID in the EphemeralRunnerSet will be reset to 0?

I'm asking all these questions because it seems we've encountered a problem that arises when several factors coincide:

We have a GitHub Workflow that is waiting for a Runner.
We create an AutoscalingRunnerSet in the cluster, and a Pod with a listener is launched.
The listener receives TotalAssignedJob and patches the EphemeralRunnerSet.
On the same iteration, a request is made to retrieve a message from the queue, and a ScaleDown patch is created for the EphemeralRunnerSet due to a nil message

After this, the GitHub Workflow remains in a queued state until another job for this type of runner appears, because the runner didn't have time to process the job due to the ScaleDown patch. Once a new job appears, the process will repeat, and there will be two jobs in the queue waiting to be executed.

According to the code in this PR, it should correct the situation, as when an nil message is received, the patch will contain the number of replicas from the previous patch.

Apr 20 '24 23:04 verdel

Hey @verdel,

You are 100% right. It was a mistake on my end where on an empty batch, I forced the minimum required number of runners to avoid wasting resources. However, I did not account for the case where the runner can be pending for the duration of the empty batch. This change would "force" the last known desired state that the listener knows about.

Apr 22 '24 10:04 nikola-jokic

Is it normal behavior that the PatchID in the EphemeralRunnerSet will be reset to 0 for each empty batch?

Apr 22 '24 10:04 verdel

It should be okay, since patch ID 0 would only mean: "make sure this state is matched". However, your question just gave me a better idea. Let me test it out and I'll get back to you.

Apr 22 '24 10:04 nikola-jokic

It should be okay, since patch ID 0 would only mean: "make sure this state is matched"

Thank you very much for the detailed explanation.

Apr 22 '24 11:04 verdel

@nikola-jokic, and I still have a question; perhaps you could help with some additional information.

The empty message (empty batch) here occurs because when the GetMessage function is called, if there are no messages in the queue, the server returns http.StatusAccepted here as a result of the long poll connection after 50 seconds, or does the server response actually come with http.StatusOk here but with an empty body?

Apr 22 '24 13:04 verdel

Thank you, @verdel, for asking all the right questions! This question made me realize that we don't need to force the state on every empty batch. Whenever we scale up, we are matching the desired state. The scale down is already handled by the very nature of the ephemeral runner, so we simply need to clean them up. Only when the scale set is completely idle, i.e. there are no jobs coming and there are no busy runners, only then we "force" the min runners state. This allows the controller to self correct.

Apr 22 '24 13:04 nikola-jokic

Only when the scale set is completely idle, i.e. there are no jobs coming and there are no busy runners, only then we "force" the min runners state

How will you determine the current number of busy runners and that it equals zero? It seems you want to obtain the number of active runners every time you receive an empty batch (every 50 seconds), and if it equals zero, then patch the EphemeralRunnerSet to set Replicas=0, am I understanding this correctly?

Since I am not very familiar with the specifics of the controller's operation, I cannot fully understand whether the following situation might recur due to an incorrect determination of the number of busy runners:

We have a runner pod in pending status, created due to a task appearing in the queue
We do not count it as active/busy and will create a patch with Replicas=0 at the next empty batch

Apr 22 '24 14:04 verdel

I'll try to explain my reasoning.

When the job is received, the number of runners is numberOfJobs + minRunners. Let's say minRunners == 0 and numberOfJobs == 1. The listener would issue a patch for target replicas 1 and set the lastPatch to 1.

Now, we start creating a pod, and it is in a pending state. Next batch comes in, and it is empty. So count == 0 && jobsCompleted == 0. We would hit the if case and then, targetRunnerCount = max(w.lastPatch, targetRunnerCount) would be the w.lastPatch. The lastPatch can go down only when jobCompleted count is > 0. Since lastPatch is 1, and minRunners is 0, we would only increment the patchID. This ensures that when we have busy runners (or runners that are about to be busy), we don't scale down. At this point, we probably don't even need to issue a patch, but after 50s, we can trigger the check to ensure that if someone deleted the ephemeral runner resource by accident, we can recover from that.

Now let's say that the job is done. We get the message saying we need 0 runners, and 1 job is done. We are not hitting the if case, and we calculate the targetRunnerCount, which would be 0. This value is set as the last patch, and patchID is incremented. On the next empty batch, we are hitting the same if case. At this point, the number of runners must be the number of idle runners (i.e. the minRunners count). The reason all of them must be idle is:

Last patch was to minRunners, so actions back-end told us that we need 0 runners, and there were runners that are done.
targetRunnerCount must always be >= minRunners
The only way lastPatch can be less than the previous lastPatch is when we have jobsCompleted, so we can't scale down before the runner picks up the job

Apr 23 '24 10:04 nikola-jokic

Let's summarize to make sure we understand all the cases in the same way.

Originally, the whole idea of this patch was to restore the number of runners in the case that a runner in a pending state was terminated instead of a runner who had completed their task and was in the process of finishing up. It is also necessary to handle the case when something forcibly terminated the work of a runner who had not yet completed their task.

In general, the only way to handle all cases is to patch the EphemeralRunnerSet regardless of whether there is an event from GitHub. You use an empty batch for this.

These empty batches result from long polling requests to the GitHub API and occur every 50 seconds. The GitHub API on its end has a timeout of 50 seconds and apparently returns HTTP 202 if there are no tasks in the queue within 50 seconds from the start of the request.

Now, a question about the operation of the controller that monitors the EphemeralRunnerSet. Let's imagine the situation:

The MinRunners value in the EphemeralRunnerSet is 0
A job is started in GitHub Actions
An event comes in: TotalAssignedJobs=1, jobsCompleted=0 (am I understanding this correctly?)
The Replicas value increases to 1, an ephemeral runner pod starts, and it begins performing the task.
Another job starts in GitHub Actions.
An event comes in: TotalAssignedJobs=2, jobsCompleted=0 (am I understanding this correctly?)
The Replicas value increases to 2, another ephemeral runner pod starts, and it is in a Pending state
The first ephemeral runner pod reports completing the work but hasn't finished yet. We receive an event: TotalAssignedJobs=1, jobsCompleted=1 (am I understanding this correctly?)
The Replicas value decreases to 1, and the controller monitoring the EphemeralRunnerSet might terminate the pod in Pending state

But if at this moment in the EphemeralRunnerSet it is indicated that Replicas=1, why wouldn’t the controller create 1 pod again after the first pod finishes?

The same happens in the case where the EphemeralRunnerSet indicates Replicas=1, and we forcibly terminate an ephemeral runner pod. Why wouldn’t the controller restore the number of replicas to the number indicated in the EphemeralRunnerSet?

Overall, the implementation outlined in this PR solves the problem. If we want to change the EphemeralRunnerSet to forcibly start the Reconcile process, which will align the number of available ephemeral runner pod with the Replicas value, the only available method is to change the PatchID (since Replicas in this case does not change).

Resetting the PatchID to 0 is also a normal solution (this will also solve the issue with this counter overflowing). I was initially a bit confused by the name of the PatchID attribute and the fact that it increments in the case of non-empty batches and resets to 0 in the case of empty ones. I just couldn't understand that it’s not quite an "revision id", which theoretically cannot decrease but should only increase. In reality, it’s just a trigger that will initiate a forced Reconcile, and its value just needs to be different from the previous one.

To repeat, the only question I have left is why the controller responsible for processing the EphemeralRunnerSet wouldn’t align the number of runner pods with the Replicas value without the forced start of Reconcile, triggered by changing the EphemeralRunnerSet?

Apr 23 '24 12:04 verdel

This summary is almost completely correct. We do not scale based on jobs assigned, we scale based on the statistics. So on each message that is not an empty batch, the server would let us know what is the desired amount of runners (not counting the minRunners) to do a job. So in your example, when the new job comes in, you would have count == 2.

To answer your question, the controller should not always match the number of replicas because the time it takes for the listener to patch the new desired replicas can be and usually is greater than the time the controller can react to ephemeral runner changes.

Each change on the ephemeral runner triggers the reconciliation loop of the ephemeral runner set. If we try and match the state each time the pod is done, we would end up creating the pod, then the patch to scale down would happen, then we would have to clean it up.

Also, important thing to note, patch ID can avoid these problems when there is a job spike. When we start creating ephemeral runners, we tag them with the patch ID. Then each transition from pending to running would not trigger the unnecessary count check and would postpone ephemeral runner cleanups. This ensures that ARC scales up as quickly as possible, since each cleanup would also again trigger the reconciliation loop, causing slower reaction to scaling up.

Every state change is communicated by the listener so only then, the ephemeral runner controller will need to correct the state. Also, when we start scaling down, these checks are going to be performed more frequently, since the maximum of all patchIDs from ephemeral runners can only be less than or equal to the patch ID of the ephemeral runner set (It will be less than when ephemeral runners with the maximum patch ID are done and cleaned up). However, in this case, we can tolerate these checks, and this is when we can potentially overscale. That is why we need the patch ID 0 to self correct when the cluster is idle.

Apr 23 '24 13:04 nikola-jokic

I carefully reviewed the code of the ephemeral runner set controller and saw how the PatchID is used. Thank you very much for the detailed response.

I was puzzled by this part of the code. In what situation could count become less than or equal to zero if the case condition is total > ephemeralRunnerSet.Spec.Replicas?

Apr 23 '24 15:04 verdel