trident icon indicating copy to clipboard operation
trident copied to clipboard

Support for Talos

Open killcity opened this issue 2 years ago • 17 comments

With the dependency on OS binaries such as mount and mkdir, Trident cannot be used with more sophisticated/progressive operating systems like Talos. Will this dependency be lifted at some point?

killcity avatar Feb 21 '23 23:02 killcity

Would also be very interested to know if this would be possible at some point.

Untersander avatar Nov 09 '23 18:11 Untersander

I am also interested in using Astra Trident on a Kubernetes cluster using Talos. Do you have a solution?

Nauno33 avatar Nov 10 '23 22:11 Nauno33

I'm working for a large retail(150+ ONTAP clusters)we'd really like to see Talos support. This keeps Netapp in the running as we vet out the best storage selection in our Kubernetes hybrid cloud environment.

We plan to move everything in that direction.

k999o avatar Jul 31 '24 03:07 k999o

Yes, this is needed. Shelling out on the nodes is not a good option. Getting rid of this dependency will benefit all Linux distros, not just Talos, as they would need much less tools installed on them.

stijoh avatar Aug 08 '24 18:08 stijoh

+1 for Talos support.

louhisuo avatar Aug 28 '24 14:08 louhisuo

+1 for Talos support.

redbeard28 avatar Sep 11 '24 08:09 redbeard28

+1 for Talos support.

sempex avatar Sep 20 '24 07:09 sempex

for what it's worth, I managed to mount a trident share on talos, by using a debian:latest BASE image in the Dockerfile (cf this commit. through that, the basic binaries needed (e.g. mkdir, mount, mount.nfs) to mount NFS shares become accessible. This is not ideal as those binaries probably have some sort of correlation with the host kernel version, but for a workaround it does it.

There are some limitations though:

  • NFSv3 with locks are not supported, because the rpc.statd daemon is not supported on Talos. that's all documented in https://github.com/siderolabs/talos/issues/6582
  • NFSv3 with the -o nolock mount option do not work either, I can't explain why. the error message is mount.nfs: Protocol not supported
  • NFSv4 (i.e. mount option nfsvers=4) does work 🎉 and it seems I was able to use locks (tested that with flock) across different nodes.

I'm not yet sure if we are "ready" to change all our current workloads to NFSv4, I have to read this netapp article on the topic first, but at least we know that technically it is not fully impossible to mount a trident NFS share on talos.

clementnuss avatar Oct 02 '24 05:10 clementnuss

TLDR; in theory it's possible, but it's tricky and I'm not going to invest more time in this for the time being.

here are my latest findings:

  • at some point I thought that the fact the the nfsv3 or nfsv4 kernel modules couldn't be found was the reason for the Protocol not found error, but that didn't help. For reference, mounting /lib/modules (from Talos) on /lib/modules (trident-main container) makes those kernel mods discoverable by tools such as modinfo etc.
  • the Protocol not supported error disappeared when I copied the /etc/protocols file from the kubelet rootfs to the trident-main container (the file was here to be precise: /run/containerd/io.containerd.runtime.v2.task/system/kubelet/rootfs/etc/protocols, thanks strace for finding that out)
  • the nfs-utils binaries (which include mount.nfs, and rpc.statd) can be installed as described in this commit and they do work.
  • NFSv3 with locks can work, provided you have start the rpcbind and rpc.statd daemons.

All of that being said, we are currently putting our trident exploration on hold, and might get back to this issue later. solving it would require:

  1. building a system extension with the rpcbind and rpc.statd daemons, which is not trivial, partly because building those from scratch with the musl library requires some adaptations it appears.
  2. starting those daemons in a dedicated pod, (e.g. in a daemonset with hostNetwork), however given how critical those daemons would be w.r.t. to locks, we do not want to adventure ourselves in this direction.

1 is much cleaner than 2, but requires too much development at this stage.

clementnuss avatar Oct 08 '24 09:10 clementnuss

+1 for Talos support.

paultrantelus avatar Mar 25 '25 18:03 paultrantelus

+1 for Talos support

CaptainQwark avatar May 13 '25 11:05 CaptainQwark

What protocols do you have an interest in being supported in Talos? This could help with prioritization.

torirevilla avatar May 19 '25 17:05 torirevilla

We would use the following protocols in our talos environment:

  1. NFSv4 (including pNFS)
  2. NVMe/TCP

magicite avatar May 19 '25 18:05 magicite

+1 for Talos support.

Shelling out is double plus ungood. Why not make use of the golang os.MkdirAll library function for mkdir purposes?

praxiscode avatar May 29 '25 21:05 praxiscode

I was able to get things to work on Talos by creating my own customized container image using the following Dockerfile:

ARG TRIDENT_TAG=25.02.1
FROM netapp/trident:${TRIDENT_TAG} as trident
FROM library/debian:latest as debian
RUN <<EOF
set -e # Fail on any error
apt update
apt-get --no-install-recommends install -y netbase nfs-common
EOF
FROM trident as build
COPY --from=debian /usr/bin/mkdir /usr/bin/
COPY --from=trident /bin/mount /bin/umount /sbin/mount.nfs /sbin/mount.nfs4 /usr/bin/
COPY --from=debian /etc/protocols /etc/services /etc/

I provide the following caveats:

  1. We are only using NFS4.2, not any of the other capabilities of Trident.
  2. The PATH of the running trident node pods does not include /bin/ and /sbin/, hence the need for the COPY --from=trident statement.
  3. It is necessary to update the helmchart values with a tridentImage: to pull the customized container image.

praxiscode avatar Jun 03 '25 16:06 praxiscode

Did anyone already try trident on Talos using iSCSI?

  • Connection to Netapp works it creates LUNs
  • PVC's can be added and correctly add a PV

But as soon as we try to mount the volume to a pod we get: failed to stage volume: multipathd is not running

Events:
  Type     Reason                  Age              From                     Message
  ----     ------                  ----             ----                     -------
  Normal   Scheduled               17s              default-scheduler        Successfully assigned default/pvc-tester to sr-os02
  Normal   SuccessfulAttachVolume  16s              attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-5f28c41e-7343-47ad-ba2d-981d295be434"
  Warning  FailedMount             3s (x4 over 7s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-5f28c41e-7343-47ad-ba2d-981d295be434" : rpc error: code = Internal desc = rpc error: code = Internal desc = failed to stage volume: multipathd is not running

SimLi1333 avatar Oct 29 '25 13:10 SimLi1333