trinityX icon indicating copy to clipboard operation
trinityX copied to clipboard

Slurm doesn't start on RHEL 8.10 node

Open xdkreij opened this issue 10 months ago • 6 comments

[issue]

Related to https://github.com/clustervision/trinityX/issues/416

Image

OS: RHEL 8.10 Trix version: '15'

[cause] unknown

[remarkable] should there be config files listed here??

Image

Image

Image

Image

[reproduce]

playbooks executed:

ansible-playbook -i hosts controller.yml -vv -k ansible-playbook default-compute.yml -v

boot node --> booted node --> slurmd failed

xdkreij avatar Apr 08 '25 13:04 xdkreij

Image

not sure why slurm config is missing...

xdkreij avatar Apr 08 '25 14:04 xdkreij

Manually adding the slurm config from the controller

Image

Now slurm can continue until munge issues kick in...

munged: Error: Failed to check keyfile "/trinity/shared/etc/mung...

/trinity/... isn't being shared at all. Possibly due to having this one on a external NFS share instead of on the controller itself.

xdkreij avatar Apr 09 '25 07:04 xdkreij

What exactly do you mean by "having this one on a external NFS share". Did you mount this share on the controller as well at the correct path at the install time? And is it properly mounted on all of your compute nodes? Because most of the installation depends on having the /trinity/shared path mounted on the nodes during boot.

bartlamboo avatar Apr 09 '25 07:04 bartlamboo

Hi Bart,

In this case, I am testing with the entire /trinity folder on an external NFS share as we are implementing an HA setup. It seems (obviously) that the controller won't re-share that

Image

I am therefore testing with the new Luna roles to mount this (haven't got AWX set up yet for an ansible-callback daemon);

I am wondering, if certain mounts would be better off as a pre-/part or postscript.

xdkreij avatar Apr 09 '25 07:04 xdkreij

# When 'type' is set to 'manual', this indicates that this is being
# taken care off outside TrinityX

This means that the administrator is responsible for mounting all the relevant filesystems on the nodes.

#   - name: '{{ trix_home }}'
#     mount: '{{ trix_home }}'
#     type: 'nfs'
#     remote: '192.168.0.1:/homes'
#     options: 'nfsvers=4,rw,retrans=4,_netdev'

The other config example above does give you a situation in which trinityx mounts the filesystems automatically.

bartlamboo avatar Apr 09 '25 08:04 bartlamboo

Possibly found out why it wasn't connecting.

Seems the slurm.conf uses deprecated configuration options.

https://slurm.schedmd.com/slurm.conf.html Image

Replaced them witth

SlurmctldHost and SlurmctldAddr

And poof! it worked :wonders:

gpu node was able to connect to the controller, and sinfo was able to fetch it's data.

xdkreij avatar Apr 11 '25 12:04 xdkreij

Thx! We'll be absolutely looking into this.

-Antoine

aphmschonewille avatar May 14 '25 23:05 aphmschonewille

We tried to reproduce using the ohpc version of slurm, for even upcoming 15.2. The current settings seems to work. When a newer version/release becomes available we will verify it again. I'll close the ticket for now. It can be reopened if in doubt.

aphmschonewille avatar Oct 11 '25 01:10 aphmschonewille

To help us reproduce and test this issue internally, could you please confirm:

1- Which OS was used on the controller?

2- Did you mean a Rocky/RHEL-based distro or specifically RHEL 8.10 for the node?

3- Which Slurm version were you running?

Thanks!

omarelkady226 avatar Oct 15 '25 15:10 omarelkady226