runner icon indicating copy to clipboard operation
runner copied to clipboard

An error occurred: Runner not found

Open theihor opened this issue 8 months ago • 33 comments

Describe the bug

Hello. I am maintaining Linux Kernel BPF CI.

We use self-hosted runners on various hardware. Recently they started failing with the following vague error: An error occurred: Runner not found. It affects all architectures that we use (x86_64, aarch64 and s390x). This then causes the runner systemd service to restart, and then the GH App quickly runs out of tokens due to frequent restarts.

Sometimes some runners are able to pick up a job when the tokens reset, but most fail with this error.

As far as I can tell, runner not found message indicates RunnerNotFoundException caught at https://github.com/actions/runner/blob/main/src/Runner.Listener/Program.cs#L148-L153

The only place it is thrown from is https://github.com/actions/runner/blob/main/src/Sdk/WebApi/WebApi/BrokerHttpClient.cs#L118-L119 Which means that github broker endpoint returns this error.

However, there is no clue what it means exactly. The runners are ephemeral with no reuse, and "replace" themselves on each run.

There were no recent changes to the self-hosted infra, or runner updates, or in the workflows.

From systemd logs on the hosts, it appears that these errors first started to appear some time in April, but in the last few days they became much more frequent to the point of blocking the testing pipeline.

To Reproduce N/A

Expected behavior

Runners should pick up and run jobs normally.

Runner Version and Platform

v2.323.0 on Linux x86_64, aarch64 and s390x

What's not working?

Runners fail to pick up jobs after successful authentication.

Job Log Output

May 13 00:07:18 bpf-ci-runner-s390x-01 systemd[1]: Started [email protected] - Ephemeral GitHub Actions Runner Container for kernel-patches - 01.
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: Runner reusage is disabled
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: Obtaining the token of the runner
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: Ephemeral option is enabled
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: Disable auto update option is enabled
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: Configuring
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: --------------------------------------------------------------------------------
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |        ____ _ _   _   _       _          _        _   _                      |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |                                                                              |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |                       Self-hosted runner registration                        |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: |                                                                              |
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: --------------------------------------------------------------------------------
May 13 00:07:19 bpf-ci-runner-s390x-01 docker[5099]: # Authentication
May 13 00:07:21 bpf-ci-runner-s390x-01 docker[5099]: √ Connected to GitHub
May 13 00:07:21 bpf-ci-runner-s390x-01 docker[5099]: # Runner Registration
May 13 00:07:22 bpf-ci-runner-s390x-01 docker[5099]: A runner exists with the same name
May 13 00:07:22 bpf-ci-runner-s390x-01 docker[5099]: √ Successfully replaced the runner
May 13 00:07:22 bpf-ci-runner-s390x-01 docker[5099]: √ Runner connection is good
May 13 00:07:22 bpf-ci-runner-s390x-01 docker[5099]: # Runner settings
May 13 00:07:22 bpf-ci-runner-s390x-01 docker[5099]: √ Settings Saved.
May 13 00:07:24 bpf-ci-runner-s390x-01 docker[5099]: √ Connected to GitHub
May 13 00:07:24 bpf-ci-runner-s390x-01 docker[5099]: Current runner version: '2.323.0'
May 13 00:07:24 bpf-ci-runner-s390x-01 docker[5099]: 2025-05-13 00:07:24Z: Listening for Jobs
May 13 00:07:25 bpf-ci-runner-s390x-01 docker[5099]: An error occurred: Runner not found
May 13 00:07:25 bpf-ci-runner-s390x-01 systemd[1]: [email protected]: Main process exited, code=exited, status=1/FAILURE
May 13 00:07:25 bpf-ci-runner-s390x-01 systemd[1]: [email protected]: Failed with result 'exit-code'.

Runner and Worker's Diagnostic Logs

Didn't have a chance to capture these. Will add a comment when I do.

theihor avatar May 13 '25 00:05 theihor

Can you share more detail about your runner? ex: what's the name of your runner and how it get setup and replaced?

TingluoHuang avatar May 13 '25 01:05 TingluoHuang

Hello, we are facing the same Issue at our organization with two of three our runners. All three runners are:

  • self hosted
  • organization scoped
  • in the same group
  • x64 running as docker container on Linux virtual machine
  • using image myoung34/github-runner

The two failing runners are ephemeral, registered via GitHub personal access token (token is valid until 2026), started by system service running docker run

ExecStartPre=-/usr/bin/docker stop %N
ExecStartPre=-/usr/bin/docker rm %N
ExecStart=/usr/bin/docker run --rm \
                              --env-file /etc/ephemeral-github-actions-runner.env \
                              -e RUNNER_NAME=%H \
                              -v /var/run/docker.sock:/var/run/docker.sock \
                              -v /tmp/runner:/tmp/runner \
                              --name %N \
                              --security-opt label=disable \
                              myoung34/github-runner:latest

The one functional runner is not emphemeral, started directly by docker compose up

services:

  github-runner:
    image: myoung34/github-runner:latest
    container_name: github-runner
    restart: unless-stopped
    security_opt:
      - label:disable
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /tmp/runner:/tmp/runner
    env_file:
      - .env

The issue occurs regardless of the runner version - with latest as with older '2.322.0'

All runners have been working properly until today. Nothing changed on our infrastructure.

Job log output

May 13 08:06:10 docker[4073]: Runner reusage is disabled
May 13 08:06:10 docker[4073]: Obtaining the token of the runner
May 13 08:06:11 docker[4073]: Ephemeral option is enabled
May 13 08:06:11 docker[4073]: Configuring
May 13 08:06:11 docker[4073]: --------------------------------------------------------------------------------
May 13 08:06:11 docker[4073]: |        ____ _ _   _   _       _          _        _   _                      |
May 13 08:06:11 docker[4073]: |       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
May 13 08:06:11 docker[4073]: |      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
May 13 08:06:11 docker[4073]: |      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
May 13 08:06:11 docker[4073]: |       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
May 13 08:06:11 docker[4073]: |                                                                              |
May 13 08:06:11 docker[4073]: |                       Self-hosted runner registration                        |
May 13 08:06:11 docker[4073]: |                                                                              |
May 13 08:06:11 docker[4073]: --------------------------------------------------------------------------------
May 13 08:06:11 docker[4073]: # Authentication
May 13 08:06:12 docker[4073]: √ Connected to GitHub
May 13 08:06:12 docker[4073]: # Runner Registration
May 13 08:06:12 docker[4073]: A runner exists with the same name
May 13 08:06:12 docker[4073]: √ Successfully replaced the runner
May 13 08:06:14 docker[4073]: √ Runner connection is good
May 13 08:06:14 docker[4073]: # Runner settings
May 13 08:06:14 docker[4073]: √ Settings Saved.
May 13 08:06:14 docker[4073]: √ Connected to GitHub
May 13 08:06:15 docker[4073]: Current runner version: '2.322.0'
May 13 08:06:15 docker[4073]: 2025-05-13 08:06:15Z: Listening for Jobs
May 13 08:06:16 docker[4073]: An error occurred: Runner not found
May 13 08:06:17 systemd[1]: ephemeral-github-actions-runner.service: Main process exited, code=exited, status=1/FAILURE
May 13 08:06:17 systemd[1]: ephemeral-github-actions-runner.service: Failed with result 'exit-code'.
May 13 08:06:17 systemd[1]: ephemeral-github-actions-runner.service: Scheduled restart job, restart counter is at 2.

kargerova avatar May 13 '25 09:05 kargerova

I removed the runners from organization settings and they (as they are ephemeral) registered again. Now runners are working properly.

So this is not an issue anymore, but don't know what really happened. Hope this helps.

kargerova avatar May 13 '25 11:05 kargerova

@TingluoHuang

Can you share more detail about your runner?

Very similar setup to what @kargerova described above.

self hosted organization scoped in the same group x64 running as docker container on Linux virtual machine

All checks. Only we customize myoung34/github-runner image, see Dockerfile.

Runners are provisioned with ansible scripts. In short, a systemd service is set up to maintain runner containers:

[Unit]
Description=Ephemeral GitHub Actions Runner Container for kernel-patches - %i
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
Restart=always
EnvironmentFile=/etc/actions-runner/runner_unit.env
# Optionally loaded file. Use this to override per runner environment
EnvironmentFile=-/etc/actions-runner/runner_unit-%i.env
ExecStartPre=-/usr/bin/docker stop %p-%i
ExecStartPre=-/usr/bin/docker rm %p-%i
ExecStartPre=-/usr/bin/docker pull ghcr.io/kernel-patches/runner:${DOCKER_TAG}
ExecStartPre=-/etc/actions-runner/gh_token_generator.sh 250536 "/etc/actions-runner/kernel-patches-runner_250536_priv.pem" "/etc/actions-runner/actions-runner-kernel-patches-worker-%i-ghtoken.env"
ExecStart=/usr/bin/docker run --device=/dev/kvm \
            --tmpfs /tmp/work:exec \
            --rm \
            --env-file "/etc/actions-runner/actions-runner-kernel-patches-worker-%i.env" \
            --env-file "/etc/actions-runner/actions-runner-kernel-patches-worker-%i-ghtoken.env" \
            --name %p-%i \
            ghcr.io/kernel-patches/runner:${DOCKER_TAG}

[Install]
WantedBy=multi-user.target

ex: what's the name of your runner and how it get setup and replaced?

Runner names have a format like this: <prefix>-<host>-worker-<i>. Examples: bm-i-0a77d6f5721623905-worker-01, bpf-ci-runner-s390x-03-worker-01.

When I say "replaced", I mean that normally I'd see Runner successfully added in the log, which I presume means that technically a new "runner" from the pov of github is created every time the container starts. Log snippet from a successful job run:

May 13 11:05:00 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: # Authentication
May 13 11:05:01 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Connected to GitHub
May 13 11:05:01 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: # Runner Registration
May 13 11:05:02 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Runner successfully added
May 13 11:05:02 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Runner connection is good
May 13 11:05:02 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: # Runner settings
May 13 11:05:02 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Settings Saved.
May 13 11:05:03 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Connected to GitHub
May 13 11:05:04 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: Current runner version: '2.323.0'
May 13 11:05:04 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: 2025-05-13 11:05:04Z: Listening for Jobs
May 13 11:05:07 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: 2025-05-13 11:05:07Z: Running job: x86_64 gcc-14 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc-14
May 13 11:11:31 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: 2025-05-13 11:11:31Z: Job x86_64 gcc-14 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc-14 completed with result: Succeeded
May 13 11:11:31 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Removed .credentials
May 13 11:11:31 ip-10-0-0-97.us-west-1.compute.internal docker[16128]: √ Removed .runner

theihor avatar May 13 '25 15:05 theihor

I removed the runners from organization settings and they (as they are ephemeral) registered again. Now runners are working properly.

So this is not an issue anymore, but don't know what really happened. Hope this helps.

@kargerova I will try this, thanks for the tip.

theihor avatar May 13 '25 15:05 theihor

For anyone interested, a working mitigation is to use unique runner names when runner container starts.

If you're using myoung34/github-runner image, you can achieve this by:

RUNNER_NAME_PREFIX=your-runner-name
RANDOM_RUNNER_SUFFIX=true
# instead of
# RUNNER_NAME=your-runner-name

These are handled by entrypoint.sh.

Given that this helps, I suspect the reason github broker might return this error is when a runner name is reused (maybe with "unexpected" access token?).

theihor avatar May 13 '25 20:05 theihor

We're experiencing this issue as well with Tartelet. Manually deleting the runner in GitHub settings fixes the issue for a while. It will eventually happen again.

djtarazona avatar May 20 '25 21:05 djtarazona

I believe this is the same or similar as #3819 - e.g. see this comment with the "Runner not found" error in the self-hosted runner logs.

patrickvinograd avatar May 29 '25 16:05 patrickvinograd

Hi It seems to me, that it doesn't matter if you're using some fork of runner base image or the official one from this repository, because

There were no recent changes to the self-hosted infra, or runner updates, or in the workflows.

The first Failed EphemeralRunners was created couple of hours after after we updated the runner base image!

Although I'm not using fork of the runner, I'm seeing very similar issues and logs myself as well from runner: Image

atsu85 avatar Jun 03 '25 07:06 atsu85

I got the same issue using tartelet with runner version 2.324.0.

GChanathip avatar Jun 03 '25 12:06 GChanathip

Same here, version 2.324.0.

austinorth avatar Jun 03 '25 18:06 austinorth

We were experiencing this issue where tasks were queued for 10+ minutes before executing, starting yesterday afternoon PDT. We restarted the listeners trying to mitigate, but no change. This morning we just upgraded our local runners to use 2.235.0 hoping to fix the tasks stalling problem. For about an hour, tasks started runners quickly and ran them to completion., but just now a task hung again, with its runner logging:

medium-579sl-runner-627xk dind cat: can't open '/proc/net/arp_tables_names': No such file or directory
medium-579sl-runner-627xk dind iptables v1.8.11 (nf_tables)
medium-579sl-runner-627xk runner Getting Dockerhub token and quay.io password from Chamber and saving it to local, encrypted storage called 'pass'. Docker will use 'pass credential helper' to authenticate to registries.
medium-579sl-runner-627xk runner mkdir: created directory '/home/runner/.password-store/docker-credential-helpers'
medium-579sl-runner-627xk runner mkdir: created directory '/home/runner/.password-store/docker-credential-helpers/OBFUSCATE'
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner mkdir: created directory '/home/runner/.password-store/docker-credential-helpers/OBFUSCATE'
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner Run original entrypoint from upstream image
medium-579sl-runner-627xk runner Waiting for docker to be ready.
medium-579sl-runner-627xk runner Docker is ready.
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner √ Connected to GitHub
medium-579sl-runner-627xk runner
medium-579sl-runner-627xk runner Current runner version: '2.325.0'
medium-579sl-runner-627xk runner 2025-06-05 17:44:22Z: Listening for Jobs
medium-579sl-runner-627xk runner An error occurred: Runner not found
medium-579sl-runner-627xk runner Runner listener exit with terminated error, stop the service, no retry needed.
medium-579sl-runner-627xk runner Exiting runner...

So it's still behaving this way for us.

mrballcb avatar Jun 05 '25 17:06 mrballcb

We seem to be running into something similar. The runners are failing with a "Runner not found" message. This is causing EphemeralRunners to be marked as failed. This seems to block new EphemeralRunners being created. This causes the number of actual runners running to be much less than the number of desired runners.

This started happening last week (around Wednesday, May 28) and has gotten really bad today. Most of the runners are now failing with this error. Not sure if the issues today are due to the current ongoing incident with Actions, but we are concerned that it happened last week as well and possibly unrelated.

We are currently on version 2.324.0.

Example logs:

[RUNNER 2025-06-05 18:21:38Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[RUNNER 2025-06-05 18:21:38Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[RUNNER 2025-06-05 18:21:38Z INFO ProcessInvokerWrapper] Finished process 49 with exit code 103, and elapsed time 00:01:48.7124819.
[RUNNER 2025-06-05 18:21:38Z INFO JobDispatcher] finish job request for job 23e48462-9d0d-5654-8b4e-d6d5ec577f7d with result: Canceled
[RUNNER 2025-06-05 18:21:38Z INFO Terminal] WRITE LINE: 2025-06-05 18:21:38Z: Job Build / Test / Test (1) completed with result: Canceled
2025-06-05 18:21:38Z: Job Build / Test / Test (1) completed with result: Canceled
[RUNNER 2025-06-05 18:21:38Z INFO JobDispatcher] Stop renew job request for job 23e48462-9d0d-5654-8b4e-d6d5ec577f7d.
[RUNNER 2025-06-05 18:21:38Z INFO JobDispatcher] job renew has been cancelled, stop renew job 23e48462-9d0d-5654-8b4e-d6d5ec577f7d.
[RUNNER 2025-06-05 18:21:38Z INFO JobNotification] Entering JobCompleted Notification
[RUNNER 2025-06-05 18:21:38Z INFO JobNotification] Entering EndMonitor
[RUNNER 2025-06-05 18:21:38Z INFO JobDispatcher] Fire signal for one time used runner.
[RUNNER 2025-06-05 18:21:38Z INFO Runner] Deleting Runner Session...
[RUNNER 2025-06-05 18:21:38Z ERR  Terminal] WRITE ERROR: An error occurred: Runner not found
An error occurred: Runner not found
[RUNNER 2025-06-05 18:21:38Z ERR  Listener] GitHub.Services.WebApi.RunnerNotFoundException: Runner not found
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Actions.RunService.WebApi.BrokerHttpClient.GetRunnerMessageAsync(Nullable`1 sessionId, String runnerVersion, Nullable`1 status, String os, String architecture, Nullable`1 disableUpdate, CancellationToken cancellationToken)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Common.BrokerServer.<>c__DisplayClass7_0.<<GetRunnerMessageAsync>b__0>d.MoveNext()
[RUNNER 2025-06-05 18:21:38Z ERR  Listener] --- End of stack trace from previous location ---
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Common.RunnerService.RetryRequest[T](Func`1 func, CancellationToken cancellationToken, Int32 maxRetryAttemptsCount, Func`2 shouldRetry)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.MessageListener.GetNextMessageAsync(CancellationToken token)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Runner.RunAsync(RunnerSettings settings, Boolean runOnce)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Runner.RunAsync(RunnerSettings settings, Boolean runOnce)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Runner.RunAsync(RunnerSettings settings, Boolean runOnce)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Runner.ExecuteRunnerAsync(RunnerSettings settings, Boolean runOnce)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Runner.ExecuteCommand(CommandSettings command)
[RUNNER 2025-06-05 18:21:38Z ERR  Listener]    at GitHub.Runner.Listener.Program.MainAsync(IHostContext context, String] args)
Runner listener exit with terminated error, stop the service, no retry needed.
Exiting runner...
stream closed EOF for arc-runners/x64-16-core-djjz4-runner-vczsc (runner)

tylermarien avatar Jun 05 '25 19:06 tylermarien

symptoms started happening without significant changes besides the base image version, as mentioned above after we updated :

There were no recent changes to the self-hosted infra, or runner updates, or in the workflows.

After i posted that i actually found out that the update time was just a coincidence and the actual problem related to EphemeralRunners in Failed state is most likely on GitHub side.

See the graph related to Failed EphemeralRunners

Image

and corrlation of Failed EphemeralRunners with

Runner listener exit with terminated error, stop the service, no retry needed

and

An error occurred: Runner not found

log lines (green showing stronger correlation, red showing weaker correlation):

Image

2 latest posts under this bug (from @mrballcb and @tylermarien) correlate well with time when we started seeing more failed EphemeralRunners than ever before and this time even githubstatus.com shows an incident (that they mostly haven't reported with such issues) :)

atsu85 avatar Jun 05 '25 19:06 atsu85

@Link-, @nikola-jokic, @TingluoHuang

This has become a serious problem that warrants immediate attention from GitHub. It's causing significant instability for many users' self-hosted runners. We've had to manually delete 50-100 failed runners per day over the past ~2 weeks due to this issue, and it's only getting worse. The screenshot below shows 150 failed runners from the last time we checked earlier today.

These messages from @Storm-BE and @patrickvinograd may help provide some insight into the problem, which seems to be related to GitHub's backend APIs and is therefore outside of our control:

  • https://github.com/actions/runner/issues/3819#issuecomment-2876870810
  • https://github.com/actions/runner/issues/3819#issuecomment-2884499460
  • https://github.com/actions/runner/issues/3819#issuecomment-2884817787

Please help us get this issue resolved ASAP.

Image

ajschmidt8 avatar Jun 05 '25 20:06 ajschmidt8

Upvoting this, this is impacting a lot of users. We are also seeing huge increase in EphemeralRunners in Failed state. We are failing to understand which runners go into this Failed State as well. From my observation, it just gets stuck randomly.

We have ended up creating cron jobs to delete the Failed Resources but that's not a great solution. This started happening randomly and I haven't made any changes to the config on our end.

pulkitanz avatar Jun 06 '25 00:06 pulkitanz

Same here

zwindler avatar Jun 06 '25 08:06 zwindler

Apologies for the delay on this, we're actively investigating this with high urgency.

Link- avatar Jun 06 '25 10:06 Link-

Apologies for the delay on this, we're actively investigating this with high urgency.

Good to hear. Since there are many linked issues. Where should we expect an update on this at some point?

tmablunar avatar Jun 06 '25 12:06 tmablunar

We've identified the issue in our API: we have a race condition where at times jobs assigned to ephemeral runners get cleaned up before the runner last poll and we respond with a 404. This is intended to reject polls from already used ephemeral runners and avoid they get considered for new jobs but in these cases is happening prematurely.

We'll have a fix ready shortly, I'll update here when it's fully rolled out. Sorry about this and thank you all for your patience ❤

rentziass avatar Jun 06 '25 15:06 rentziass

Our fix is now fully rolled out, and new runners should now stop failing (especially those that ran a job that was canceled) due to Runner not found errors. Please let us know if you see any repeats and thanks again for your patience ❤

rentziass avatar Jun 06 '25 20:06 rentziass

@rentziass @Link-

The issue of Failed Ephemeral Runners persisting is gone. They do get deleted but we are seeing a different issue for some of the runner now.

There's a lot of runner pods which are not getting deleted after being finished. The jobs on these pods have been finished but the EphemeralRunner is still stuck in the Running state. On the pod these are the logs I can see

[RUNNER 2025-06-10 03:45:46Z ERR GitHubActionsService] POST request to https://run-actions-1-azure-eastus.actions.githubusercontent.com/51/acquirejob failed. HTTP Status: NotFound [RUNNER 2025-06-10 03:45:46Z INFO Runner] Skipping message Job. Job message not found 'e7ba9489-0333-5830-b74b-9ee48f4fe123'. job not found

Once this happens, the runner goes into an endless loop of

[RUNNER 2025-06-10 03:45:47Z INFO ConfigurationStore] HasCredentials() [RUNNER 2025-06-10 03:45:47Z INFO ConfigurationStore] stored True [RUNNER 2025-06-10 03:45:47Z INFO CredentialManager] GetCredentialProvider [RUNNER 2025-06-10 03:45:47Z INFO CredentialManager] Creating type OAuth [RUNNER 2025-06-10 03:45:47Z INFO CredentialManager] Creating credential type: OAuth

This is only happening to about 50% of the jobs. This is introducing a lot of pods which should've been cleaned up in a stuck state, are there any ongoing issues with the GitHubActionsService still?

pulkitanz avatar Jun 10 '25 14:06 pulkitanz

@pulkitanz thanks for the report, I don't see a particular uptick in 404s for that endpoint, do you have workflow run URLs you are able to share for jobs that ran on runners that got stuck? Those would really help a ton, if you can't share them publicly can you send them to me at rentziass at github dot com please?

rentziass avatar Jun 10 '25 15:06 rentziass

Apologies, @rentziass after digging more into these stuck EphemeralRunners, I can see that no jobs were picked up by the pod since startup. I will correct myself here.

On startup of the pod, It gets connected to Github then starts listening for jobs, then we see the above errors and it goes into a neverending loop. The interesting bit is that it doesn't go into a Failing but Running state.

For a normal running job, which is working,

the message is something like

Job request 0 for plan 8e9b3a43-6b53-4fbe-9bf0-2a147d41ca2ae job a895ce2c-be9b-5c11-8f3f-ca7h3d12b5vd9 received.

for a stuck Ephemeral Runner I am seeing

Skipping message Job. Job message not found 'e7bb9489-0233-5831-b74b-9ea48f4fe123'. job not found

pulkitanz avatar Jun 10 '25 16:06 pulkitanz

We are running into a lot of similar issues with this now. We are seeing 2 different errors. This is causing nearly all of our runners to fail and jobs to get backed up.

The first successfully completes the Job but then fails and causes the Pod to not shut down properly. The error looks like this:

[RUNNER 2025-06-10 21:32:38Z INFO JobDispatcher] Worker finished for job 55a0cdf0-ff04-58b9-9ff6-d09fa3efe876. Code: 100
[RUNNER 2025-06-10 21:32:38Z INFO JobDispatcher] finish job request for job 55a0cdf0-ff04-58b9-9ff6-d09fa3efe876 with result: Succeeded
[RUNNER 2025-06-10 21:32:38Z INFO Terminal] WRITE LINE: 2025-06-10 21:32:38Z: Job checks completed with result: Succeeded
2025-06-10 21:32:38Z: Job checks completed with result: Succeeded
[RUNNER 2025-06-10 21:32:38Z INFO JobDispatcher] Stop renew job request for job 55a0cdf0-ff04-58b9-9ff6-d09fa3efe876.
[RUNNER 2025-06-10 21:32:38Z INFO JobDispatcher] job renew has been cancelled, stop renew job 55a0cdf0-ff04-58b9-9ff6-d09fa3efe876.
[RUNNER 2025-06-10 21:32:38Z INFO JobNotification] Entering JobCompleted Notification
[RUNNER 2025-06-10 21:32:38Z INFO JobNotification] Entering EndMonitor
[RUNNER 2025-06-10 21:32:38Z INFO MessageListener] Received job status event. JobState: Online
[RUNNER 2025-06-10 21:32:38Z INFO JobDispatcher] Fire signal for one time used runner.
[RUNNER 2025-06-10 21:32:38Z INFO Runner] Job has finished at backend, the runner will exit since it is running under onetime use mode.
[RUNNER 2025-06-10 21:32:38Z INFO Runner] Stop message queue looping.
[RUNNER 2025-06-10 21:32:38Z WARN GitHubActionsService] GET request to https://broker.actions.githubusercontent.com/message?sessionId=be21f4e5-c460-46c2-928a-b4f983b94110&status=Busy&runnerVersion=2.324.0&os=Linux&architecture=X64&disableUpdate=true has been cancelled.

Then there is a different error that appears to happen before the job starts and causes the Job to fail and the Pod to not shut down properly. That error looks like:

[RUNNER 2025-06-10 21:12:08Z ERR  GitHubActionsService] POST request to https://pipelinesghubeus22.actions.githubusercontent.com/npk8Tyy7iEmzYdj6qlVQf0NQBis55Zt6C2D54jWMfRe24YFcrK/_apis/oauth2/token failed. HTTP Status: BadRequest
[RUNNER 2025-06-10 21:12:08Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener] Catch exception during create session.
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener] GitHub.Services.OAuth.VssOAuthTokenRequestException: Registration 3892bd78-2553-4a10-a32a-1e0c95c38743 was not found.
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.OAuth.VssOAuthTokenProvider.OnGetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenOperation.GetTokenAsync(VssTraceActivity traceActivity)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.Common.IssuedTokenProvider.GetTokenAsync(IssuedToken failedToken, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.Common.VssHttpRetryMessageHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken)
[RUNNER 2025-06-10 21:12:08Z ERR  MessageListener]    at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token)
[RUNNER 2025-06-10 21:12:08Z ERR  Terminal] WRITE ERROR: Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.
Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

tylermarien avatar Jun 10 '25 21:06 tylermarien

Echoing this (I think the same as the last couple comments) - we are seeing a new failure mode where we now have a large number of runners showing in "Running" status for on the order of hours, but with no JOBWORKFLOWREF assigned, and therefore seemingly doing no work.

Looking at the logs, it also appears that we see this log message from the scale-set-controller:

Job completed message received.

before the runner is even started up:

√ Connected to GitHub
Current runner version: '2.323.0'

patrickvinograd avatar Jun 11 '25 23:06 patrickvinograd

This is even worse than the previous failure mode because the pods stay running/consuming resources on our cluster.

patrickvinograd avatar Jun 11 '25 23:06 patrickvinograd

The fix that I used is to create a cron job to clean up all the EphemeralRunner Resources which are stuck in a running state and have no jobRepositoryName assigned with age > 15 mins

kubectl get ephemeralrunners -n <namespace-where-runners-live> -o json | jq -r '
                .items[] 
                | select(
                    (.status.jobRepositoryName == null or .status.jobRepositoryName == "") and 
                    ((now - (.metadata.creationTimestamp | fromdateiso8601)) > (1*60*60))
                  ) 
                | "kubectl delete ephemeralrunner \(.metadata.name) -n \(.metadata.namespace)"
              ' | bash

Deleting Stuck Ephemeral Runners has helped us, we are waiting for a fix from GitHub end.

pulkitanz avatar Jun 12 '25 02:06 pulkitanz

I'd like to comment that

  1. the issue related to Failed EphemeralRunners hasn't appeared for us after @rentziass reported that the issue should be fixed
  2. We haven't noticed other issues mentioned after that in this GH issue
  • No logs with messages that @pulkitanz mentioned
  • although our logs contain also contain same messages that @tylermarien mentioned

Failed to create a session. The runner registration has been deleted from the server, please re-configure. Runner registrations are automatically deleted for runners that have not connected to the service recently.

then they don't seem to cause big issues - maybe longer GHA job pickup time (but i didn't bother checking it)? 3) The script that @pulkitanz mentioned would kill also runners that are waiting idle due to minRunners or AutoscalingRunnerSet. The script didn't find anything useful for us even for runners that don't have minRunners configured

atsu85 avatar Jun 13 '25 13:06 atsu85

Hey everyone, sorry for the lack of updates on this one: resolving the 404 that would kill runners revealed another bug if a job was assigned to an ephemeral runner but got cancelled before the runner picked it up; the runner on its own wouldn't know it had been "used", but assignment wouldn't consider it again for other jobs. We have finished rolling out a backend fix for this yesterday (14:31 UTC) where we would delete these runners when their job completes, regardless if they started it or not.

For ARC runners this means ARC will automatically clean those runners up, while standalone runners will eventually be unable to refresh their session and shutdown. If you have any remaining runners in this state they will automatically cleaned and fall under that behaviour in the next ~24 hours.

Longer term we also want to introduce a way to gracefully tell runners they should shutdown (i.e. handle 404 without throwing as an example).

Once again sorry this impacted you and thank you for your patience ❤

rentziass avatar Jun 13 '25 14:06 rentziass