zombie containerd-shim processes
$ docker pull docker:20-dind
20-dind: Pulling from library/docker
Digest: sha256:4e1e22f471afc7ed5e024127396f56db392c1b6fc81fc0c05c0e072fb51909fe
Status: Image is up to date for docker:20-dind
docker.io/library/docker:20-dind
$ docker run -dit --privileged --name test docker:20-dind dockerd
1ee25dc98bf4bc5e232abe27a9e651b18cbfb8b3f6ca981c3ae64c894584e7b4
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
33 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
154 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Verifying Checksum
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
33 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
220 root 0:00 [containerd-shim]
294 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
33 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
220 root 0:00 [containerd-shim]
331 root 0:00 [containerd-shim]
429 root 0:00 [containerd-shim]
529 root 0:00 [containerd-shim]
600 root 0:00 ps faux
If I do the same test with --init or ... docker:20-dind docker-init dockerd, then we get no zombies.
I think this is technically a bug in containerd, because I can reproduce with bare containerd as pid1 as well, but it doesn't seem quite the same as https://github.com/containerd/containerd/issues/5708 (although perhaps related).
cc @thaJeztah @cpuguy83
$ docker run -dit --privileged --name test --volume /var/lib/containerd docker:20-dind containerd
2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 containerd
44 root 0:00 ps faux
$ docker exec test ctr i pull docker.io/tianon/true:latest
...
$ docker exec test ctr run --rm docker.io/tianon/true:latest foo
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 containerd
110 root 0:00 [containerd-shim]
152 root 0:00 ps faux
The simplest "fix" (workaround) for this repository is something like adjusting ENTRYPOINT ["dockerd-entrypoint.sh"] to ENTRYPOINT ["docker-init", "dockerd-entrypoint.sh"].
(If you don't trust our entrypoint script [which, fair], you can also reproduce just the same with --entrypoint dockerd :sweat_smile:)
Temporary workaround is up in https://github.com/docker-library/docker/pull/319 (to just throw docker-init on top of dockerd).
Did you open a ticket in containerd as well? (of the existing ones don't match this scenario?)
I didn't file an issue there yet, but I've commented at https://github.com/containerd/containerd/issues/5708#issuecomment-883780174 now (because it feels way too similar to be coincidence, IMO).
Quoting https://github.com/containerd/containerd/issues/5708#issuecomment-884998021 here for posterity:
I'm facing something that seems really closely related (and IMO it doesn't feel like it can be pure coincidence), although maybe not exactly the same? When running Docker in Docker (or even just raw conatinerd-in-Docker), I'm seeing 100% reliable behavior where every invocation of a container ends up in a
containerd-shimzombie, and it goes away if I run the container withtinias pid1 instead:$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind 2fa1f7a0b543808572a7a2da7ad28fd165d783f1ac8f3e9c59ebb30417f43b9f $ docker exec test ps faux PID USER TIME COMMAND 1 root 0:00 containerd 44 root 0:00 ps faux $ docker exec test ctr i pull docker.io/tianon/true:latest ... $ docker exec test ctr run --rm docker.io/tianon/true:latest foo $ docker exec test ps faux PID USER TIME COMMAND 1 root 0:00 containerd 110 root 0:00 [containerd-shim] 152 root 0:00 ps faux$ docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd --init docker:20-dind 5d2d6ac195d6fdbb0646b6df8d64de3ac00c4ae3fc0dce62bdd8eb59ac20a322 $ docker exec test ps faux PID USER TIME COMMAND 1 root 0:00 /sbin/docker-init -- containerd 8 root 0:00 containerd 32 root 0:00 ps faux $ docker exec test ctr i pull docker.io/tianon/true:latest ... $ docker exec test ctr run --rm docker.io/tianon/true:latest foo $ docker exec test ps faux PID USER TIME COMMAND 1 root 0:00 /sbin/docker-init -- containerd 8 root 0:00 containerd 142 root 0:00 ps faux(See also docker-library/docker#318.)
@tianon The ctr uses containerd-shim-runc-v2 by default right now. The shimv2 binary will re-exec itself to start the running shim server, which makes that the parent pid of running shim server is 1. But the containerd isn't the reaper for the exited child processes. That is why that is zombie shim in dind.
And when use
io.containerd.runtime.v1.linuxas runtime, the runtime will call the containerd to publish that exit event.https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/cmd/containerd-shim/main_unix.go#L287-L318
But the
ctr runwill delete the task when the task is stop.https://github.com/containerd/containerd/blob/a963242f78c8a05967dfe050cab1016ac7aeabee/runtime/v1/shim/service.go#L509-L541
The
p.SetExited(e.Status)will notify thectrthat the task quit. So, both thetask.Deleteinctrandeventpublish action are handled in the same time. And the containerD will kill the shim force so that thecontainerdcreated by shim will be zombie.➜ vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind 82f541cbb604077d99f76da45d9b866e03de577ffb209bf88b437e41ddca8440 ➜ vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null ➜ vagrant docker exec test ctr run --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo ➜ vagrant docker exec test ps -ef PID USER TIME COMMAND 1 root 0:00 containerd 107 root 0:00 [containerd] 122 root 0:00 ps -efIf you run the
foocontainer with detach mode, the shim will reap thatcontainerdcommand.➜ vagrant docker run -dit --privileged --name test --volume /var/lib/containerd --entrypoint containerd docker:20-dind 97243d2c9667a246827a07eca736f666dc9f0864744f532fb7bf16f7d80dda08 ➜ vagrant docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null ➜ vagrant docker exec test ctr run -d --runtime io.containerd.runtime.v1.linux docker.io/tianon/true:latest foo ➜ vagrant docker exec test ps -ef PID USER TIME COMMAND 1 root 0:00 containerd 74 root 0:00 containerd-shim -namespace default -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/default/foo -address /run/containerd/containerd.sock -containerd-binary /usr/local/bin/containerd 112 root 0:00 ps -ef ➜ vagrant docker exec test ctr c rm foo ➜ vagrant docker exec test ps -ef PID USER TIME COMMAND 1 root 0:00 containerd 140 root 0:00 ps -ef
FWIW, I can still reproduce (using --entrypoint this time to avoid #319): :disappointed:
$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:dind
dind: Pulling from library/docker
Digest: sha256:a7a9383d0631b5f6b59f0a8138912d20b63c9320127e3fb065cb9ca0257a58b2
Status: Downloaded newer image for docker:dind
41749ef585c457ff1e737f7ef2efc6ac8d3395219a6526c25f042c31bc43ca01
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
22 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
138 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
c53fb220cbad: Pulling fs layer
c53fb220cbad: Download complete
c53fb220cbad: Pull complete
Digest: sha256:009cce421096698832595ce039aa13fa44327d96beedb84282a69d3dbcf5a81b
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
22 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
196 root 0:00 [containerd-shim]
270 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
22 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml --log-level info
196 root 0:00 [containerd-shim]
303 root 0:00 [containerd-shim]
376 root 0:00 ps faux
$ docker exec test docker version
Client:
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 22:56:42 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:01:45 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.6.6
GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
runc:
Version: 1.1.2
GitCommit: v1.1.2-0-ga916309f
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Coming back a year later to ring the bell again: :sob:
$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:dind
dind: Pulling from library/docker
Digest: sha256:87d892c14d2b755ac4e8268b21e8c8a7ff7f44b52753e265b7a300d2fa065d50
Status: Image is up to date for docker:dind
99217162d401fa0c9785053345702d946c7e5fb241be3a6faf84dfb4056a13ce
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
23 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml
189 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
4e30b577f37b: Pulling fs layer
4e30b577f37b: Verifying Checksum
4e30b577f37b: Download complete
4e30b577f37b: Pull complete
Digest: sha256:45b95352fad44acee2c35a4ddc2205b61448b1daf2ba2c949b7136582446e682
Status: Downloaded newer image for tianon/true:latest
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
23 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml
248 root 0:00 [containerd-shim]
316 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
23 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml
248 root 0:00 [containerd-shim]
346 root 0:00 [containerd-shim]
411 root 0:00 ps faux
$ docker exec test docker version
Client:
Version: 27.0.2
API version: 1.46
Go version: go1.21.11
Git commit: 912c1dd
Built: Wed Jun 26 18:46:21 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 27.0.2
API version: 1.46 (minimum version 1.24)
Go version: go1.21.11
Git commit: e953d76
Built: Wed Jun 26 18:47:59 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc:
Version: 1.1.13
GitCommit: v1.1.13-0-g58aa920
docker-init:
Version: 0.19.0
GitCommit: de40ad0
@tianon does the same happen with docker 26.1 with the same containerd version, or only 27.0? (I know we updated to containerd 1.7, bit I think the DIND image already had it?
cc @dmcgowan
It's a long time, not sure whether is same condition. I changed the host kernel from realtime to a generic one, then problem solved.
Yes, 26 is also affected:
$ docker run -dit --privileged --name test --entrypoint dockerd --pull=always docker:26-dind
26-dind: Pulling from library/docker
Digest: sha256:dfaffff209798d9efe4ec07243d172ba8706918859c87869656a5d3091df44bb
Status: Image is up to date for docker:26-dind
94ddbbe9823bad23454556b690c854e6ac8b7e06adc71095676d7ccf2c7ef9d2
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
26 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml
163 root 0:00 ps faux
$ docker exec test docker run --rm tianon/true
Unable to find image 'tianon/true:latest' locally
latest: Pulling from tianon/true
4e30b577f37b: Pulling fs layer
4e30b577f37b: Verifying Checksum
4e30b577f37b: Download complete
4e30b577f37b: Pull complete
Digest: sha256:45b95352fad44acee2c35a4ddc2205b61448b1daf2ba2c949b7136582446e682
Status: Downloaded newer image for tianon/true:latest
$ docker exec test docker run --rm tianon/true
$ docker exec test ps faux
PID USER TIME COMMAND
1 root 0:00 dockerd
26 root 0:00 containerd --config /var/run/docker/containerd/containerd.toml
197 root 0:00 [containerd-shim]
277 root 0:00 [containerd-shim]
336 root 0:00 ps faux
$ docker exec test docker version
Client:
Version: 26.1.4
API version: 1.45
Go version: go1.21.11
Git commit: 5650f9b
Built: Wed Jun 5 11:27:57 2024
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 26.1.4
API version: 1.45 (minimum version 1.24)
Go version: go1.21.11
Git commit: de5c9cf
Built: Wed Jun 5 11:29:25 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc:
Version: 1.1.12
GitCommit: v1.1.12-0-g51d5e94
docker-init:
Version: 0.19.0
GitCommit: de40ad0
This isn't specific to the way dockerd runs/supervises containerd either:
$ docker run -dit --rm --name test --privileged --pull=always tianon/containerd:rc
rc: Pulling from tianon/containerd
Digest: sha256:bc0d7e7f36b2963769c4924a11bf1da09f501cbccdc7cb8c2f5d011d0d066440
Status: Image is up to date for tianon/containerd:rc
9f2cb8622b6ac98c90a0d2fbe325993199d71f5c469941a7c2117492c1d8ad12
$ docker exec test ctr i pull docker.io/tianon/true:latest > /dev/null
$ docker exec test ctr run --rm docker.io/tianon/true:latest test
$ docker exec test ctr run --rm docker.io/tianon/true:latest test
$ docker exec test ctr run --rm docker.io/tianon/true:latest test
$ # "tianon/containerd" doesn't have "ps" and I can't convince "docker top" to show zombies 🙈
$ docker run --rm --pid container:test bash ps faux
PID USER TIME COMMAND
1 root 0:00 containerd
91 root 0:00 [containerd-shim]
166 root 0:00 [containerd-shim]
248 root 0:00 [containerd-shim]
299 root 0:00 ps faux
$ docker exec test ctr version
Client:
Version: v2.0.0-rc.3
Revision: 27de5fea738a38345aa1ac7569032261a6b1e562
Go version: go1.22.4
Server:
Version: v2.0.0-rc.3
Revision: 27de5fea738a38345aa1ac7569032261a6b1e562
UUID: 46bfcb40-716f-46fb-8887-6010373bed51