sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

How to run Sysbox in containerd > 2.0

Open kangyanzhou opened this issue 6 months ago • 15 comments

Hi team,

I recently tried to use Sysbox in our k8s cluster with containerd runtime. The CRI-O based setup already works, want to see whether I can just use containerd so that there will be little divergence for our cluster setup. I tried to upgrade containerd to > 2.0.0 so that it has the userns support. I ran into the same issue as in https://github.com/nestybox/sysbox/issues/950.

How should I supply the same annotations for containerd? Thanks!

Ubuntu version: 22.04 Kernel version: 5.15

kangyanzhou avatar Aug 05 '25 03:08 kangyanzhou

Having Sysbox running with containerd would be a great enhancement. At the moment the setup with the patched CRIO is a bit cumbersome

gadiener avatar Aug 07 '25 12:08 gadiener

I have investigated a bit from sysbox side regarding this problem, I have compared between how CRIO and Containerd sends the container spec to sysbox runc, specifically the linux capabilites, here are the results

CRIO + SYSBOX

2025/08/15 13:10:01 SYSBOX DEBUG process cap spec: specs.LinuxCapabilities{Bounding:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Effective:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Inheritable:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Permitted:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Ambient:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}}

However in Containerd

Containerd + SYSBOX

2025/08/15 13:09:29 SYSBOX DEBUG process cap spec: specs.LinuxCapabilities{Bounding:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Effective:[]string{}, Inheritable:[]string{}, Permitted:[]string{}, Ambient:[]string{}}

As you can see the

Effective:[]string{}, Inheritable:[]string{}, Permitted:[]string{}, Ambient:[]string{}}

are all empty for containerd which I suspect it causes the container not able to mount /proc and /sys properly, not really sure if thats useful or not but any help will be appreciated, thanks!

galal-hussein avatar Aug 22 '25 22:08 galal-hussein

Note: related to: https://github.com/nestybox/sysbox/issues/465#issuecomment-1956632337

ctalledo avatar Aug 29 '25 21:08 ctalledo

Hi @kangyanzhou, thanks for filing the issue.

I ran into the same issue as in https://github.com/nestybox/sysbox/issues/950.

That means you run into an error such as:

level=error msg="container_linux.go:439: starting container process caused: process_linux.go:6 │
│ 08: container init caused: rootfs_linux.go:71: setting up rootfs mounts caused: rootfs_linux.go:1220: mounting \"sysfs\" to rootfs \"/var/lib/sysbox/rootfs/8162946f0eae9098a353d16b61d3341472347f6762c0ada18cc4 │
│ 8f8f2c91c872/overlay2/merged\" at \"sys\" caused: mount through procfd: operation not permitted"

That likely indicates an issue in the user-namespace setup by containerd. It's expected given that Sysbox does not yet support containerd user-namespaces.

This is a task we should look into now, moving it to "Investigation" phase.

ctalledo avatar Aug 29 '25 21:08 ctalledo

Update: seems like sysbox-runc PR 106, courtesy of @galal-hussein, may be sufficient to fix this.

ctalledo avatar Aug 29 '25 22:08 ctalledo

Do we have a guide on running Sysbox with containerd?

bindrad avatar Sep 09 '25 15:09 bindrad

Do we have a guide on running Sysbox with containerd?

Just update your /etc/containerd/config.toml as documented in PR 106 and restart containerd. That's all. I assume you have containerd v2 and updated sysbox-runc binary in place. Actually you have to build sysbox-runc from master branch yourself. Hope they will release new binaries soon.

mueckinger avatar Sep 12 '25 17:09 mueckinger

Is there a way to run sysbox with containerd v1.6-1.7?

bindrad avatar Sep 16 '25 19:09 bindrad

As far as I understand sysbox utilizes all linux namespaces. User namespace support is only experimental in containerd v1.7 and stable since v2.0. I believe that's why sysbox had this crazy fallback with the k8s daemonset which installs cri-o through the back door on containerd based clusters as it supported user namespaces ever since. So the short answer is no. Jump on the containerd v2+ bandwagon, which was released already ~ a year ago. BTW v1.6 is EOL.

mueckinger avatar Sep 19 '25 14:09 mueckinger

@ctalledo when can we expect the main repo to be updated with the fixes?, it would really help https://github.com/rancher/k3k/issues/582 to have a release that can be integrated in our solution. thank you!

galal-hussein avatar Nov 28 '25 12:11 galal-hussein

@mueckinger i saw your comment about being able to run sysbox with containerd 2.1.4 in k8s v1.33.4. Does that mean you are setting spec.hostUsers=false for the sysbox-runc pods and setting the idsPerPod in KubeletConfiguration? Do you also need to do anything for /etc/subuid and /etc/subgid according to https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/ ?

DekusDenial avatar Dec 06 '25 01:12 DekusDenial

@DekusDenial i think it depends on your node's OS. In my case (Ubuntu 24.04 LTS) everything related to subuid/subgid was already set and it worked with the defaults. No changes in kubelet config as well. Regarding the podSpec you have to set:

      runtimeClassName: sysbox-runc
      hostUsers: false

Don't forget to apply the RuntimeClass which creates the link between the runtime name in the podSpec and the runtime defined in containerd/config.toml:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: sysbox-runc
handler: sysbox-runc

mueckinger avatar Dec 06 '25 20:12 mueckinger

@DekusDenial i think it depends on your node's OS. In my case (Ubuntu 24.04 LTS) everything related to subuid/subgid was already set and it worked with the defaults. No changes in kubelet config as well. Regarding the podSpec you have to set:

      runtimeClassName: sysbox-runc
      hostUsers: false

Don't forget to apply the RuntimeClass which creates the link between the runtime name in the podSpec and the runtime defined in containerd/config.toml:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: sysbox-runc
handler: sysbox-runc

is your UID range always fixed at 65536? I believe there was a fix going into kubelet from upstream k8s but in my k8s vendor cluster it still doesn't have it, hence my question regarding idsPerPod in KubeletConfiguration

DekusDenial avatar Dec 06 '25 23:12 DekusDenial

I just checked. This is the content of my /etc/subuid and /etc/subgid:

kubelet:65536:7208960
...
sysbox:362144:65536

I have no idsPerPod parameter in my kubelet config,yaml

mueckinger avatar Dec 09 '25 19:12 mueckinger

I just checked. This is the content of my /etc/subuid and /etc/subgid:

kubelet:65536:7208960
...
sysbox:362144:65536

I have no idsPerPod parameter in my kubelet config,yaml

~~What’s the content of the /proc/self/uid_map inside the pod?~~ actually if no idsPerPod set default is 65536 anyway.

DekusDenial avatar Dec 09 '25 21:12 DekusDenial

@mueckinger when running sysbox 0.6.7 with containerd 2+, I have noticed that the mount /proc/sys/fs/binfmt_misc somehow is read-only, but that's not the case sysbox 0.6.7 with cri-o 1.33 where the mount is rw. Do you see the same?

DekusDenial avatar Dec 19 '25 15:12 DekusDenial

@mueckinger when running sysbox 0.6.7 with containerd 2+, I have noticed that the mount /proc/sys/fs/binfmt_misc somehow is read-only, but that's not the case sysbox 0.6.7 with cri-o 1.33 where the mount is rw. Do you see the same?

@DekusDenial, i can confirm the folder is read-only

mueckinger avatar Dec 20 '25 17:12 mueckinger