Strange behaviour with CI/CD
Hello,
I've recently upgraded from 13.7 through to 14.10.3 (taking a lot of steps along the way). All my projects are fine and everything seems stable with the exception of the CI/CD area. I can't say I'm massively experienced with the runners, but I have registered two runners successfully. However when I click on them in the Admin/Runners area a new page doesn't load, the same when I click the edit button. They show up as in contact "just now".
When I assign a project to a runner, or use a shared runner, and try to run them I get "The scheduler failed to assign job to the runner, please try again or contact system administrator". The logs of the runner don't show anything, but I also don't get any errors in the logs.
I'm also getting conflicting messages on the validity of my .gitlab-ci.yml, in the gitlab editor it says its valid, then on execution it isn't.
Any guidance would be appreciated. I don't have a huge amount on this server, so nuking it and restarting from scratch is an option. It feels like something is wrong at the backend, maybe with the jobs and runners tables?
Edit: I made a new gitlab instance and new runner with the same settings below, and it works. Any advice on getting my original instance back running would be very welcome though.
version: '3'
services:
gitlab-runner:
image: gitlab/gitlab-runner:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ./config:/etc/gitlab-runner
restart: unless-stopped
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "test"
url = "redacted address"
token = "bsxW8-TySKikXJnCDirt"
executor = "docker"
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "gradle:jdk11-jammy"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
image: gradle:jdk11-jammy
variables:
GRADLE_OPTS: "-Dorg.gradle.daemon=false"
before_script:
- export GRADLE_USER_HOME=`pwd`/.gradle
build:
stage: build
script: gradle --build-cache assemble
cache:
key: "$CI_COMMIT_REF_NAME"
policy: push
paths:
- build
- .gradle
tags:
- test
I've seen something similar in the past, all "git" and on the surface, most Web UI features worked - but like you experienced the GitLab runners errored like crazy all over the place (especially "The scheduler failed to assign job to the runner, please try again or contact system administrator"). I also found that when I navigated to Admin / Settings / CI-CD in the Web UI that also generated a 500 error.
In the end, I reverted back to an older "known to be good" backup and walked through the upgrade(s) again - testing at each step to ensure runners were working. this got me to my desired endpoint (ie GitLab 15.0.1, PostgreSQL 13.7)
Although unqualified, my gut feeling was somewhere between the "known to be good position" and post-upgrade where runners were failing, the database was out of sync with the app. Almost like the app upgrade hadn't applied the database changes, or flushed the app side cache - or something along those lines.
I never worked out how to fix the root cause, but as I said a (slow) methodical multi-step upgrade with runner tests at each step got me there. in the end.