finch pull images failed

finch pull --platform=amd64 xxx

FATA[1167] failed to extract layer sha256:9cc8d31519b533c03cd8347147f9ea0b9bfbda4650200d388a1495a34812283f: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount3705620677: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount3705620677/kubeflow/src" for UID 29511686, GID 1085706827: lchown /var/lib/containerd/tmpmounts/containerd-mount3705620677/kubeflow/src: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown FATA[1168] exit status 1

Dec 02 '22 03:12 haozi4263

Is this image public/shareable? This looks like an image that uses extremely large UIDs and/or GIDs, which when running rootless (or simply via a runtime with user namespaces enabled) means you have exhausted the (standard 2^16) ~65k range of UIDs/GIDs used to map filesystem ownership. I expect this image will not run on any rootless/user namespace-enabled container runtime, unless the /etc/sub{u,g}id files are created which allow a significant range of subordinate IDs to be used within containers.

I'm not quite sure what the value of using IDs in the very high range (that UID is somewhere above 2^24?; GID is even larger!) are, but if you own the image, I would be curious why the need for extremely large integers for the owner and group.

Dec 02 '22 05:12 estesp

image: ccr.ccs.tencentyun.com/cube-studio/kubeflow-dashboard:2022.09.01 is publish use docker pull ccr.ccs.tencentyun.com/cube-studio/kubeflow-dashboard:2022.09.01 is ok

Dec 02 '22 09:12 haozi4263

Reproduced in Finch.

FATA[0125] failed to extract layer sha256:9cc8d31519b533c03cd8347147f9ea0b9bfbda4650200d388a1495a34812283f: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount3084210000: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount3084210000/kubeflow/src" for UID 29511686, GID 1085706827: lchown /var/lib/containerd/tmpmounts/containerd-mount3084210000/kubeflow/src: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown
FATA[0114] exit status 1

However, it worked with the nerdctl built from v1.0.0 tag, which is what we are using in Finch. Will continue the investigation

Dec 05 '22 21:12 ningziwen

It's important compare nerdctl (or any other runtime tool) running the same way it is inside Finch, which based on the output is running inside a user namespace ("rootless" mode, specifically); the container shown will probably work on any container runtime that is not running the container within a user namespace (either "rootless" mode or simply inside a root-created user namespace with a specific range of subordinate uid and gids). If you use the nerdctl install that sets up rootless on a Linux system, you should be able to reproduce the same issue, unless you use an extremely large subordinate mapping for the ID ranges.

Dec 05 '22 22:12 estesp

Reproduced in nerdctl in finch VM shell.

FATA[0139] failed to extract layer sha256:9cc8d31519b533c03cd8347147f9ea0b9bfbda4650200d388a1495a34812283f: mount callback failed on /var/lib/containerd/tmpmounts/containerd-mount1146161846: failed to Lchown "/var/lib/containerd/tmpmounts/containerd-mount1146161846/kubeflow/src" for UID 29511686, GID 1085706827: lchown /var/lib/containerd/tmpmounts/containerd-mount1146161846/kubeflow/src: invalid argument (Hint: try increasing the number of subordinate IDs in /etc/subuid and /etc/subgid): unknown

Dec 06 '22 00:12 ningziwen

Validated it can work after extending subuid and subgid.

[ningziwe@lima-finch ningziwe]$ cat /etc/subuid
ningziwe:100000:29700000
[ningziwe@lima-finch ningziwe]$ cat /etc/subgid
ningziwe:100000:1085800000
[ningziwe@lima-finch ningziwe]$
logout
➜  ~ finch pull ccr.ccs.tencentyun.com/cube-studio/kubeflow-dashboard:2022.09.01
...
elapsed: 339.7s                                                                   total:  942.4  (2.8 MiB/s)

Workaround:

# Log in VM shell
LIMA_HOME=/Applications/Finch/lima/data /Applications/Finch/lima/bin/limactl shell finch

# In VM shell, modify /etc/subuid and /etc/subgid to a larger number
sudo vi /etc/subuid
sudo vi /etc/subgid

# Logout VM shell and restart finch VM
finch vm stop
finch vm start

# Try to pull the image again
finch pull ccr.ccs.tencentyun.com/cube-studio/kubeflow-dashboard:2022.09.01

Dec 06 '22 20:12 ningziwen

As @estesp mentioned, the root cause is the image has extremely large UID/GID but the default number is 65536 in Finch.

I found a relevant issue in k8s. From the issue, 65536 is the default UID/GID number for most distributions and this issue is to fix the extremely large UID/GID in image side.

I suggest referring this issue and checking if the UID/GID of your image should/could be adjusted.

If you find it is necessary to use images with extremely large UID/GID, please elaborate the use case here. We can discuss making subuid/subgid configurable if the use case can be justified.

Dec 06 '22 21:12 ningziwen

The large uid/guid issue was resolved by switching to rootful container inside VM. https://github.com/runfinch/finch/issues/196

Mar 03 '23 22:03 ningziwen