awsome-distributed-training icon indicating copy to clipboard operation
awsome-distributed-training copied to clipboard

SMHP Slurm clusters with OZFS file system not able to ssh into instance

Open amanshanbhag opened this issue 1 year ago • 2 comments

ssm into compute nodes from the cluster head node doesn't work. This is most likely because ssh keys are stored in /fsx/<user>, and they need to be copied over to /home/<user>

amanshanbhag avatar Apr 30 '25 18:04 amanshanbhag

Potential permissions issue. Current permission for authorized_keys is set to 644, but it should be 600. Need to test why this is happening (and why it works without OZFS mounted)

amanshanbhag avatar Apr 30 '25 23:04 amanshanbhag

sudo ln -s /fsx/ubuntu/.ssh /home/ubuntu/.ssh && sudo chmod 600 ~/.ssh/authorized_keys works. Need to test once why permissions were changed on authorized_keys (test without OZFS)

amanshanbhag avatar May 01 '25 16:05 amanshanbhag

Is this fixed?

mhuguesaws avatar May 22 '25 19:05 mhuguesaws

Fixed in #700

amanshanbhag avatar May 27 '25 19:05 amanshanbhag