Slurm doesn't start on RHEL 8.10 node
[issue]
Related to https://github.com/clustervision/trinityX/issues/416
OS: RHEL 8.10 Trix version: '15'
[cause] unknown
[remarkable] should there be config files listed here??
[reproduce]
playbooks executed:
ansible-playbook -i hosts controller.yml -vv -k
ansible-playbook default-compute.yml -v
boot node --> booted node --> slurmd failed
not sure why slurm config is missing...
Manually adding the slurm config from the controller
Now slurm can continue until munge issues kick in...
munged: Error: Failed to check keyfile "/trinity/shared/etc/mung...
/trinity/... isn't being shared at all. Possibly due to having this one on a external NFS share instead of on the controller itself.
What exactly do you mean by "having this one on a external NFS share". Did you mount this share on the controller as well at the correct path at the install time? And is it properly mounted on all of your compute nodes? Because most of the installation depends on having the /trinity/shared path mounted on the nodes during boot.
Hi Bart,
In this case, I am testing with the entire /trinity folder on an external NFS share as we are implementing an HA setup. It seems (obviously) that the controller won't re-share that
I am therefore testing with the new Luna roles to mount this (haven't got AWX set up yet for an ansible-callback daemon);
I am wondering, if certain mounts would be better off as a pre-/part or postscript.
# When 'type' is set to 'manual', this indicates that this is being
# taken care off outside TrinityX
This means that the administrator is responsible for mounting all the relevant filesystems on the nodes.
# - name: '{{ trix_home }}'
# mount: '{{ trix_home }}'
# type: 'nfs'
# remote: '192.168.0.1:/homes'
# options: 'nfsvers=4,rw,retrans=4,_netdev'
The other config example above does give you a situation in which trinityx mounts the filesystems automatically.
Possibly found out why it wasn't connecting.
Seems the slurm.conf uses deprecated configuration options.
https://slurm.schedmd.com/slurm.conf.html
Replaced them witth
SlurmctldHost and SlurmctldAddr
And poof! it worked :wonders:
gpu node was able to connect to the controller, and sinfo was able to fetch it's data.
Thx! We'll be absolutely looking into this.
-Antoine
We tried to reproduce using the ohpc version of slurm, for even upcoming 15.2. The current settings seems to work. When a newer version/release becomes available we will verify it again. I'll close the ticket for now. It can be reopened if in doubt.
To help us reproduce and test this issue internally, could you please confirm:
1- Which OS was used on the controller?
2- Did you mean a Rocky/RHEL-based distro or specifically RHEL 8.10 for the node?
3- Which Slurm version were you running?
Thanks!