build node-test-pull-request is stuck

It looks like https://ci.nodejs.org/job/node-test-pull-request/43286/ is the first job that is stuck. The reason appears to be node-test-binary-arm-12+. Most of the subsequent PR jobs appear to be stuck because of this.

Apr 03 '22 18:04 cjihrig

It's not stuck, it's just really backed up because we're down to just two online Pi 2's.

cc @rvagg

Apr 03 '22 18:04 richardlau

As an aside I've just added a new https://ci.nodejs.org/job/node-test-binary-armv7l/ job to https://ci.nodejs.org/job/node-test-commit-arm-fanned/. It's similar to https://ci.nodejs.org/job/node-test-binary-arm-12+/ but runs the tests in armv7l containers on the Equinix hosted Ampere Altra arm64 machines instead of the Pi's.

The idea for now is to run both https://ci.nodejs.org/job/node-test-binary-arm-12+/ and https://ci.nodejs.org/job/node-test-binary-armv7l/ in parallel (i.e. test in the containers and on the Pi's) but it will also give us some form of test coverage on armv7l if we have to temporarily disable the https://ci.nodejs.org/job/node-test-binary-arm-12+/ due to the availability of the Pi's.

Apr 03 '22 19:04 richardlau

For anyone else in @nodejs/jenkins-admins, if the remaining two Pi 2's go offline, disable https://ci.nodejs.org/job/node-test-binary-arm-12+/ by pressing the blue "disable project" button on the job's main page and comment in this issue to note that you have done so:

That won't unstick any in progress jobs (but will prevent future jobs from being scheduled) -- you'd either have to abort those or wait for the Pi 2's to become available again and the job re-enabled.

Apr 03 '22 19:04 richardlau

Unless it got fixed and then somehow recurred, I think this has been going on since Tuesday.

Apr 03 '22 20:04 Trott

Unless it got fixed and then somehow recurred, I think this has been going on since Tuesday.

And I now see @richardlau dropped a link to this issue from that conversation.

Apr 03 '22 20:04 Trott

okie dokie! maintenance is pretty overdue on the cluster .. I just did a hard reboot of the whole lot and we have a bunch of 3's and 2's back online.

My TODO involves doing system updates (I'll start that now) and then cleaning out the docker containers and replacing them with newer ones that @richardlau has coded up for us recently.

Apr 04 '22 02:04 rvagg

I've been back working on this, a bunch of nodes dropped offline during update and not long afterward, we have updates that haven't completed. I'm re-running the process and it's long and tedious and I'll need to be restarting them as I go so this may be disruptive for the next day or two while I keep chipping at this.

Apr 06 '22 10:04 rvagg

OK, I've done the updates I wanted for this initial phase and I'll let them run as they are for now. I probably won't get around to adding the new Docker container for another week at least (I'm out completely next week). We're currently down 4 each of the 2's and 3's, but the others will hopefully be stable enough; we'll see though!

Apr 07 '22 01:04 rvagg

Is this something we can close? Or not yet?

Jun 12 '22 14:06 Trott