How to run Sysbox in containerd > 2.0
Hi team,
I recently tried to use Sysbox in our k8s cluster with containerd runtime. The CRI-O based setup already works, want to see whether I can just use containerd so that there will be little divergence for our cluster setup. I tried to upgrade containerd to > 2.0.0 so that it has the userns support. I ran into the same issue as in https://github.com/nestybox/sysbox/issues/950.
How should I supply the same annotations for containerd? Thanks!
Ubuntu version: 22.04 Kernel version: 5.15
Having Sysbox running with containerd would be a great enhancement. At the moment the setup with the patched CRIO is a bit cumbersome
I have investigated a bit from sysbox side regarding this problem, I have compared between how CRIO and Containerd sends the container spec to sysbox runc, specifically the linux capabilites, here are the results
CRIO + SYSBOX
2025/08/15 13:10:01 SYSBOX DEBUG process cap spec: specs.LinuxCapabilities{Bounding:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Effective:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Inheritable:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Permitted:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Ambient:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}}
However in Containerd
Containerd + SYSBOX
2025/08/15 13:09:29 SYSBOX DEBUG process cap spec: specs.LinuxCapabilities{Bounding:[]string{"CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID", "CAP_KILL", "CAP_SETGID", "CAP_SETUID", "CAP_SETPCAP", "CAP_LINUX_IMMUTABLE", "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_ADMIN", "CAP_NET_RAW", "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_SYS_MODULE", "CAP_SYS_RAWIO", "CAP_SYS_CHROOT", "CAP_SYS_PTRACE", "CAP_SYS_PACCT", "CAP_SYS_ADMIN", "CAP_SYS_BOOT", "CAP_SYS_NICE", "CAP_SYS_RESOURCE", "CAP_SYS_TIME", "CAP_SYS_TTY_CONFIG", "CAP_MKNOD", "CAP_LEASE", "CAP_AUDIT_WRITE", "CAP_AUDIT_CONTROL", "CAP_SETFCAP", "CAP_MAC_OVERRIDE", "CAP_MAC_ADMIN", "CAP_SYSLOG", "CAP_WAKE_ALARM", "CAP_BLOCK_SUSPEND", "CAP_AUDIT_READ", "CAP_PERFMON", "CAP_BPF", "CAP_CHECKPOINT_RESTORE"}, Effective:[]string{}, Inheritable:[]string{}, Permitted:[]string{}, Ambient:[]string{}}
As you can see the
Effective:[]string{}, Inheritable:[]string{}, Permitted:[]string{}, Ambient:[]string{}}
are all empty for containerd which I suspect it causes the container not able to mount /proc and /sys properly, not really sure if thats useful or not but any help will be appreciated, thanks!
Note: related to: https://github.com/nestybox/sysbox/issues/465#issuecomment-1956632337
Hi @kangyanzhou, thanks for filing the issue.
I ran into the same issue as in https://github.com/nestybox/sysbox/issues/950.
That means you run into an error such as:
level=error msg="container_linux.go:439: starting container process caused: process_linux.go:6 │
│ 08: container init caused: rootfs_linux.go:71: setting up rootfs mounts caused: rootfs_linux.go:1220: mounting \"sysfs\" to rootfs \"/var/lib/sysbox/rootfs/8162946f0eae9098a353d16b61d3341472347f6762c0ada18cc4 │
│ 8f8f2c91c872/overlay2/merged\" at \"sys\" caused: mount through procfd: operation not permitted"
That likely indicates an issue in the user-namespace setup by containerd. It's expected given that Sysbox does not yet support containerd user-namespaces.
This is a task we should look into now, moving it to "Investigation" phase.
Update: seems like sysbox-runc PR 106, courtesy of @galal-hussein, may be sufficient to fix this.
Do we have a guide on running Sysbox with containerd?
Do we have a guide on running Sysbox with containerd?
Just update your /etc/containerd/config.toml as documented in PR 106 and restart containerd. That's all. I assume you have containerd v2 and updated sysbox-runc binary in place. Actually you have to build sysbox-runc from master branch yourself. Hope they will release new binaries soon.
Is there a way to run sysbox with containerd v1.6-1.7?
As far as I understand sysbox utilizes all linux namespaces. User namespace support is only experimental in containerd v1.7 and stable since v2.0. I believe that's why sysbox had this crazy fallback with the k8s daemonset which installs cri-o through the back door on containerd based clusters as it supported user namespaces ever since. So the short answer is no. Jump on the containerd v2+ bandwagon, which was released already ~ a year ago. BTW v1.6 is EOL.
@ctalledo when can we expect the main repo to be updated with the fixes?, it would really help https://github.com/rancher/k3k/issues/582 to have a release that can be integrated in our solution. thank you!
@mueckinger i saw your comment about being able to run sysbox with containerd 2.1.4 in k8s v1.33.4. Does that mean you are setting spec.hostUsers=false for the sysbox-runc pods and setting the idsPerPod in KubeletConfiguration? Do you also need to do anything for /etc/subuid and /etc/subgid according to https://kubernetes.io/docs/concepts/workloads/pods/user-namespaces/ ?
@DekusDenial i think it depends on your node's OS. In my case (Ubuntu 24.04 LTS) everything related to subuid/subgid was already set and it worked with the defaults. No changes in kubelet config as well. Regarding the podSpec you have to set:
runtimeClassName: sysbox-runc
hostUsers: false
Don't forget to apply the RuntimeClass which creates the link between the runtime name in the podSpec and the runtime defined in containerd/config.toml:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: sysbox-runc
handler: sysbox-runc
@DekusDenial i think it depends on your node's OS. In my case (Ubuntu 24.04 LTS) everything related to subuid/subgid was already set and it worked with the defaults. No changes in kubelet config as well. Regarding the podSpec you have to set:
runtimeClassName: sysbox-runc hostUsers: falseDon't forget to apply the RuntimeClass which creates the link between the runtime name in the podSpec and the runtime defined in containerd/config.toml:
apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: sysbox-runc handler: sysbox-runc
is your UID range always fixed at 65536? I believe there was a fix going into kubelet from upstream k8s but in my k8s vendor cluster it still doesn't have it, hence my question regarding idsPerPod in KubeletConfiguration
I just checked. This is the content of my /etc/subuid and /etc/subgid:
kubelet:65536:7208960
...
sysbox:362144:65536
I have no idsPerPod parameter in my kubelet config,yaml
I just checked. This is the content of my /etc/subuid and /etc/subgid:
kubelet:65536:7208960 ... sysbox:362144:65536I have no
idsPerPodparameter in my kubelet config,yaml
~~What’s the content of the /proc/self/uid_map inside the pod?~~ actually if no idsPerPod set default is 65536 anyway.
@mueckinger when running sysbox 0.6.7 with containerd 2+, I have noticed that the mount /proc/sys/fs/binfmt_misc somehow is read-only, but that's not the case sysbox 0.6.7 with cri-o 1.33 where the mount is rw. Do you see the same?
@mueckinger when running sysbox 0.6.7 with containerd 2+, I have noticed that the mount
/proc/sys/fs/binfmt_miscsomehow is read-only, but that's not the case sysbox 0.6.7 with cri-o 1.33 where the mount isrw. Do you see the same?
@DekusDenial, i can confirm the folder is read-only