trinityX Slurm doesn't start on RHEL 8.10 node

[issue]

Related to https://github.com/clustervision/trinityX/issues/416

OS: RHEL 8.10 Trix version: '15'

[cause] unknown

[remarkable] should there be config files listed here??

[reproduce]

playbooks executed:

ansible-playbook -i hosts controller.yml -vv -k ansible-playbook default-compute.yml -v

boot node --> booted node --> slurmd failed

Apr 08 '25 13:04 xdkreij

not sure why slurm config is missing...

Apr 08 '25 14:04 xdkreij

Manually adding the slurm config from the controller

Now slurm can continue until munge issues kick in...

munged: Error: Failed to check keyfile "/trinity/shared/etc/mung...

/trinity/... isn't being shared at all. Possibly due to having this one on a external NFS share instead of on the controller itself.

Apr 09 '25 07:04 xdkreij

What exactly do you mean by "having this one on a external NFS share". Did you mount this share on the controller as well at the correct path at the install time? And is it properly mounted on all of your compute nodes? Because most of the installation depends on having the /trinity/shared path mounted on the nodes during boot.

Apr 09 '25 07:04 bartlamboo

Hi Bart,

In this case, I am testing with the entire /trinity folder on an external NFS share as we are implementing an HA setup. It seems (obviously) that the controller won't re-share that

I am therefore testing with the new Luna roles to mount this (haven't got AWX set up yet for an ansible-callback daemon);

I am wondering, if certain mounts would be better off as a pre-/part or postscript.

Apr 09 '25 07:04 xdkreij

# When 'type' is set to 'manual', this indicates that this is being
# taken care off outside TrinityX

This means that the administrator is responsible for mounting all the relevant filesystems on the nodes.

#   - name: '{{ trix_home }}'
#     mount: '{{ trix_home }}'
#     type: 'nfs'
#     remote: '192.168.0.1:/homes'
#     options: 'nfsvers=4,rw,retrans=4,_netdev'

The other config example above does give you a situation in which trinityx mounts the filesystems automatically.

Apr 09 '25 08:04 bartlamboo

Possibly found out why it wasn't connecting.

Seems the slurm.conf uses deprecated configuration options.

https://slurm.schedmd.com/slurm.conf.html

Replaced them witth

SlurmctldHost and SlurmctldAddr

And poof! it worked :wonders:

gpu node was able to connect to the controller, and sinfo was able to fetch it's data.

Apr 11 '25 12:04 xdkreij

Thx! We'll be absolutely looking into this.

-Antoine

May 14 '25 23:05 aphmschonewille

We tried to reproduce using the ohpc version of slurm, for even upcoming 15.2. The current settings seems to work. When a newer version/release becomes available we will verify it again. I'll close the ticket for now. It can be reopened if in doubt.

Oct 11 '25 01:10 aphmschonewille

To help us reproduce and test this issue internally, could you please confirm:

1- Which OS was used on the controller?

2- Did you mean a Rocky/RHEL-based distro or specifically RHEL 8.10 for the node?

3- Which Slurm version were you running?

Thanks!

Oct 15 '25 15:10 omarelkady226