runner icon indicating copy to clipboard operation
runner copied to clipboard

Self hosted runners failing with WRITE ERROR: A session for this runner already exists

Open soumik-avoma opened this issue 1 year ago • 15 comments

We are using self-hosted ephimeral runners for our github jobs. Recently we are seeing jobs failing randomly with the error This job failed on the UI. On further digging we found out that it's trying to run the job on a runner which is already in Active state instead of Idle. We also saw the following logs.

ERR Terminal] WRITE ERROR: A session for this runner already exists It gets stuck in this process for long and then fails. Ex:

Waiting for a runner to pick up this job...
Job is about to start running on the runner: ip-10-0-87-43 (organization)

We even tried upgrading the action agent to v2.319.1 , but the issue still persists. All of our runners have the same labels hence the job could easily pickup any runner in Idle state, but instead its trying to pickup a runner which is in Active state.

soumik-avoma avatar Aug 26 '24 12:08 soumik-avoma

Hi @soumik-avoma, do you have a link to a workflow run where this issue is occurring?

luketomlinson avatar Aug 26 '24 13:08 luketomlinson

Hi @luketomlinson, Since the workflow are private, hence won't be able to share that. But here are the logs:

2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z INFO MessageListener] Sleeping for 30 seconds before retrying. |  
-- | -- | --
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z INFO MessageListener] The session conflict exception haven't reached retry limit. |  
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z ERR  Terminal] WRITE ERROR: A session for this runner already exists. |  
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z INFO MessageListener] The session for this runner already exists. |  
  |   | 2024-08-26 16:05:53.816 | at GitHub.Runner.Listener.MessageListener.CreateSessionAsync(CancellationToken token) |  
  |   | 2024-08-26 16:05:53.816 | at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpMethod method, IEnumerable`1 additionalHeaders, Guid locationId, Object routeValues, ApiResourceVersion version, HttpContent content, IEnumerable`1 queryParameters, Object userState, CancellationToken cancellationToken) |  
  |   | 2024-08-26 16:05:53.816 | at GitHub.Services.WebApi.VssHttpClientBase.SendAsync[T](HttpRequestMessage message, Object userState, CancellationToken cancellationToken) |  
  |   | 2024-08-26 16:05:53.816 | at GitHub.Services.WebApi.VssHttpClientBase.SendAsync(HttpRequestMessage message, HttpCompletionOption completionOption, Object userState, CancellationToken cancellationToken) |  
  |   | 2024-08-26 16:05:53.816 | at GitHub.Services.WebApi.VssHttpClientBase.HandleResponseAsync(HttpResponseMessage response, CancellationToken cancellationToken) |  
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z ERR  MessageListener] GitHub.DistributedTask.WebApi.TaskAgentSessionConflictException: The actions runner ip-10-0-74-151-1724668474 already has an active session. |  
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z ERR  MessageListener] Catch exception during create session. |  
  |   | 2024-08-26 16:05:53.816 | [2024-08-26 10:35:53Z ERR  GitHubActionsService] POST request to https://pipelinesghubeus2.actions.githubusercontent.com/85uueRDPTlsbvfv9f8kaXsAIyvEp4gmsRFQMwFV501YbBtzszu/_apis/distributedtask/pools/1/sessions failed. HTTP Status: Conflict |  
  |   | 2024-08-26 16:05:53.566 | [2024-08-26 10:35:53Z INFO Terminal] WRITE LINE: |  
  |   | 2024-08-26 16:05:53.566 | [2024-08-26 10:35:53Z INFO Terminal] WRITE LINE: |  
  |   | 2024-08-26 16:05:53.566 | [2024-08-26 10:35:53Z INFO MessageListener] VssConnection created |  
  |   | 2024-08-26 16:05:52.813 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.813 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.813 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] Establish connection with 60 seconds timeout. |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] Establish connection with 60 seconds timeout. |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] Establish connection with 100 seconds timeout. |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO MessageListener] Connecting to the Runner Server... |  
  |   | 2024-08-26 16:05:52.312 | [2024-08-26 10:35:52Z INFO MessageListener] Attempt to create session. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO Listener] Runner execution has finished with return code 0 |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] Finished process 2052 with exit code 0, and elapsed time 00:00:00.0011824. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] Process started with process id 2052, waiting for process exit. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] Updated oom_score_adj to 500 for PID: 2052. |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   High priority process: 'False' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Persist current code page: 'False' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: '' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Require exit code zero: 'True' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Working directory: '/home/ubuntu/actions-runner' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   Arguments: '755 "/home/ubuntu/actions-runner/svc.sh"' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper]   File name: '/usr/bin/chmod' |  
  |   | 2024-08-26 16:05:25.232 | [2024-08-26 10:34:56Z INFO ProcessInvokerWrapper] Starting process: |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO UnixUtil] Running /usr/bin/chmod 755 "/home/ubuntu/actions-runner/svc.sh" |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO UnixUtil] Location: '/usr/bin/chmod' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO UnixUtil] Which2: 'chmod' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/actions-runner' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO HostContext] Well known directory 'Root': '/home/ubuntu/actions-runner' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO HostContext] Well known directory 'Bin': '/home/ubuntu/actions-runner/bin' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO SystemDControlManager] Service name 'actions.runner.ORG_NAME.ip-10-0-74-151-1724668474.service' display name 'GitHub Actions Runner (ORG_NAME.ip-10-0-74-151-1724668474)' will be used for service configuration. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO Terminal] WRITE LINE: |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO Terminal] WRITE LINE: |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO ConfigurationStore] Settings Saved. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO ConfigurationStore] Saving runner settings. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO PromptManager] ReadValue |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO CommandSettings] Flag 'unattended': 'True' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:56Z INFO CommandSettings] Arg 'work': '' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:55Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:55Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/ubuntu/actions-runner/.credentials_rsaparams |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:55Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/ubuntu/actions-runner/.credentials_rsaparams |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Finished operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] Establish connection with 60 seconds timeout. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] Establish connection with 60 seconds timeout. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO GitHubActionsService] Starting operation Location.GetConnectionData |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] Establish connection with 100 seconds timeout. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:54Z INFO RunnerServer] EstablishVssConnection |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO RSAFileKeyManager] Loading RSA key parameters from file /home/ubuntu/actions-runner/.credentials_rsaparams |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CredentialManager] Creating credential type: OAuth |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CredentialManager] Creating type OAuth |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CredentialManager] GetCredentialProvider |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO ConfigurationStore] stored True |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO ConfigurationStore] HasCredentials() |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO ConfigurationStore] Credentials Saved. |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO ConfigurationStore] Saving OAuth credential @ /home/ubuntu/actions-runner/.credentials |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CommandSettings] Flag 'ephemeral': 'True' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CommandSettings] Flag 'ephemeral': 'True' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CommandSettings] Flag 'disableupdate': 'False' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CommandSettings] Flag 'no-default-labels': 'False' |  
  |   | 2024-08-26 16:05:25.231 | [2024-08-26 10:34:53Z INFO CommandSettings] Flag 'disableupdate': 'False'

soumik-avoma avatar Aug 26 '24 14:08 soumik-avoma

👋Hi @soumik-avoma From our logs, I see that runner attempting to create a session twice. Once at 2024-08-26T10:35:13UTC and again 11 seconds later at 2024-08-26T10:35:24UTC. The 2nd one results in the conflict you were seeing. Is it possible in your setup/automation there is something causing ./run.sh to be called multiple times?

luketomlinson avatar Aug 26 '24 21:08 luketomlinson

This is what we use for registering the runner. We do not call run.sh anywhere explicitly.

    payload=$(curl -sX POST -H "Authorization: token $GITHUB_PAT" https://api.github.com/orgs/ORG_NAME/actions/runners/registration-token)
    export RUNNER_TOKEN=$(echo $payload | jq .token --raw-output)
    mkdir -p actions-runner && cd actions-runner
    curl -o actions-runner-linux-$ARCH-2.319.1.tar.gz -L https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-linux-$ARCH-2.319.1.tar.gz
    echo "03d993c65e0c4daa5e3bf5a5a35ba356f363bdb5ceb6b5808fd52fdb813dd8e8  actions-runner-linux-$ARCH-2.319.1.tar.gz" | shasum -a 256 -c
    tar xzf ./actions-runner-linux-$ARCH-2.319.1.tar.gz
    cd /home/ubuntu/actions-runner
    ./config.sh \
        --name $(hostname) \
        --token $RUNNER_TOKEN \
        --url https://github.com/ORG_NAME \
        --unattended \
        --labels $PURPOSE \
        --replace \
        --ephemeral
    sudo ./svc.sh install

soumik-avoma avatar Aug 27 '24 04:08 soumik-avoma

Hi @soumik-avoma please try at least doubling the memory of your runner. It happens, that after an OOM event, the runner restarts. And then it tries to create a new session whereas the previous session was not gracefully ended.

Donnie avatar Oct 02 '24 10:10 Donnie

Hi @soumik-avoma please try at least doubling the memory of your runner. It happens, that after an OOM event, the runner restarts. And then it tries to create a new session whereas the previous session was not gracefully ended.

How can we verify this with logs or other hints?

liriID avatar Dec 12 '24 09:12 liriID

We have the same situation in our self hosted runners windows

fgendorf avatar Jan 24 '25 14:01 fgendorf

Running into the same issue with macOS runners. Action runners keep randomly disappearing from github and if I check the local logs, I see they failed like this a few days before.

kaybutter avatar Jan 31 '25 13:01 kaybutter

We had a similar issue. We run the ARC in EKS and deleting the pod wasn't helping because we kept seeing that A session for this runner already exists.

Manually removing the runner (not the runner scale set) helped. After the pod went down, a new one was spawned, a new runner was registered and it started accepting jobs.

wozniakpl avatar Feb 25 '25 10:02 wozniakpl

Same issue, subscribed

AndreySvinarenko avatar Apr 11 '25 11:04 AndreySvinarenko

I'm having the exact same issue like wozniakpl

kimxogus avatar May 15 '25 03:05 kimxogus

Adding more memory to our runners resolved the issue for me, even though I noticed that OOM (Out Of Memory) errors aren't handled as gracefully as I would have expected. Just wanted to share in case it helps others facing similar problems.

alexisloiselle avatar Jul 08 '25 17:07 alexisloiselle

I also had some success increasing memory (after I had tried to remove all ARC related components from my kubernetes cluster) and deleting all idle/offline runners on the github side.

If resources are the issue, would it be possible to make this very obvious in a log/event/error?

piyat avatar Aug 20 '25 13:08 piyat

Increasing memory is only kicking the can down the road. If there's a power outage, network partition, etc. then things are going to be non-gracefully disconnected from GH. The fact that ./config.sh has --replace but then ./run.sh still fails is a failure of the Github API design and only Github can fix that.

Which it hasn't, for several years now :|

Qix- avatar Nov 20 '25 12:11 Qix-

same issue on macOS

julyfun avatar Nov 24 '25 13:11 julyfun