Sync Users causes problem if UID/GIDs are not equal between sms/compute nodes.
On page 26 of the document CentOS8.2 Base OS xCAT/SLURM Edition for Linux (rev 1e970dd16), it is instructed to sync users between SMS and compute nodes by:
# Create a sync file for pushing user credentials to the nodes
[sms]# echo "MERGE:" > syncusers
[sms]# echo "/etc/passwd -> /etc/passwd" >> syncusers
[sms]# echo "/etc/group -> /etc/group" >> syncusers
[sms]# echo "/etc/shadow -> /etc/shadow" >> syncusers
# Use xCAT to distribute credentials to nodes
[sms]# xdcp compute -F syncusers
This causes problem if, for some reason, the uid/gid <1000 are not identical between sms/compute nodes. For example, the gid for ssh_key group may differ, so a syncing will make compute nodes not accessible from ssh. Also, the uid/gid for munge may differ, causing munge failed to start in compute nodes, and so Slurm which depends on munge.
Ideally, we should only sync normal users and groups (uid/gid >=1000 for CentOS8). The method is described in this issue in xCAT repo: https://github.com/xcat2/xcat-core/issues/6108
We try to deal with this by also recommending to copy the passwd/group files into the $CHROOT prior to installing things like slurm/munge to ensure consistency between the head node and compute nodes. This works in our CI testing for the forthcoming 2.1 release.
I do agree with the original request in https://github.com/xcat2/xcat-core/issues/6108 that it would be convenient if xCAT had a built-in mechanism to only sync normal user IDs and we could take advantage of this in a future ohpc recipe if it existed (as opposed to having to ask the admin to cull out the desired accounts into a temporary set of files each time).
FYI I am now using the following scripts to deal with this issue (for stateful compute nodes). Feel free if you wish to add this into ohpc recipes.
[root@mgmt syncs]# cat /opt/xcat/syncuser
MERGE:
/tmp/passwd.sync -> /etc/passwd
/tmp/group.sync -> /etc/group
/tmp/shadow.sync -> /etc/shadow
/tmp/gshadow.sync -> /etc/gshadow
[root@mgmt syncs]# cat /admin/scripts/syncs/sync-users
#!/bin/bash
## we only sync uid/gid >=1000
cat /etc/group|sort -t: -k3 -n|awk -F: '$3>=1000' > /tmp/group.sync
cat /etc/passwd|sort -t: -k3 -n|awk -F: '$3>=1000' > /tmp/passwd.sync
for line in $(</etc/shadow); do [ "$(id -u $(echo $line |cut -d: -f1) 2>/dev/null|| echo 0)" -ge 1000 ] && echo $line ;done > /tmp/shadow.sync
for line in $(</etc/gshadow); do [ "$(id -u $(echo $line |cut -d: -f1) 2>/dev/null|| echo 0)" -ge 1000 ] && echo $line ;done > /tmp/gshadow.sync
xdcp all -F /opt/xcat/syncuser
Following the Recipe for the 3.x branch for Rocky 9.x, Slurm, Warewulf the copy group/password files to the chroot no longer work. Roughly following the install recipe I get a
--- /etc/group 2023-05-19 16:15:49.686987949 -0400
+++ /opt/ohpc/admin/images/rocky9/etc/group 2023-05-19 16:19:25.607430072 -0400
@@ -30,28 +30,11 @@
systemd-journal:x:190:
systemd-coredump:x:997:
dbus:x:81:
-sssd:x:996:
-polkitd:x:995:
-printadmin:x:994:
-ssh_keys:x:993:
-sgx:x:992:
-libstoragemgmt:x:991:
-systemd-oom:x:990:
-tss:x:59:clevis
-cockpit-ws:x:989:
-cockpit-wsinstance:x:988:
-clevis:x:987:
-setroubleshoot:x:986:
-sshd:x:74:
-slocate:x:21:
-chrony:x:985:
-tcpdump:x:72:
+ssh_keys:x:996:
rpc:x:32:
rpcuser:x:29:
-screen:x:84:
-mysql:x:27:
-apache:x:48:
-warewulf:x:984:apache
-dhcpd:x:177:
-munge:x:983:
+sshd:x:74:
+tss:x:59:
+munge:x:995:
slurm:x:202:
+chrony:x:201:
Specifically this break munge.