fluentd log driver at cause of: docker stop <image-container> returns on host A but hungs on host B, leaving docker in a corrupted state
- [x] This is a bug report
- [ ] This is a feature request
- [x] I searched existing issues before opening this one (#397 is not the best match despite inspect hungs too)
Expected behavior
Given
- an image based on https://alpinelinux.org/posts/Alpine-3.15.0-released.html and
- two hosts with the same kernel and the same docker:
- Linux q19 3.10.0-1160.21.1.el7.x86_64 #1 SMP Mon Feb 22 18:03:13 EST 2021 x86_64 x86_64 x86_64 GNU/Linux
- Docker version 19.03.5, build 633a0ea
- containerd.io-1.3.7-3.1.el7.x86_64
respective containers should work on both or or none of the hosts.
Actual behavior
Depending of host the exec succeeds (A) or fails (B).
Steps to reproduce the behavior
Execute
docker exec -it <container-of-the-image> /bin/echo Q
on both hosts.
Here I have no clue why host B fails so sharing strace from both A and B in hope it might give a clue.
strace reported:
host A
execve("/usr/bin/docker", ["docker", "exec", "-it", "32330080d832", "/bin/echo", "Q"], 0x7ffe0ea56e48 /* 23 vars */) = 0
brk(NULL) = 0x562b36219000
... snip because it showed not much helpful ...
ioctl(0, SNDCTL_TMR_START or TCSETS, {B38400 opost isig icanon echo ...}) = 0
futex(0xc000282148, FUTEX_WAKE_PRIVATE, 1) = 1
write(3, "GET /v1.40/exec/a99c357d2a264c1c"..., 155) = 155
futex(0xc0000ae848, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x562b34b27d88, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ exited with 0 +++
host B
execve("/usr/bin/docker", ["docker", "exec", "-it", "43e0fdccad0e", "/bin/echo", "Q"], 0x7fff830cf818 /* 21 vars */) = 0
brk(NULL) = 0x5563f5382000
... snip ...
write(3, "GET /v1.40/containers/43e0fdccad"..., 109) = 109
epoll_pwait(4, [{EPOLLOUT, {u32=2668982024, u64=139687890349832}}], 128, 0, NULL, 824636239372) = 1
epoll_pwait(4,
^Cstrace: Process 27465 detached
<detached ...>
Output of docker version:
Docker version 19.03.5, build 633a0ea
Output of docker info:
host A
Client:
Debug Mode: false
Server:
Containers: 10
Running: 10
Paused: 0
Stopped: 0
Images: 66
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 23.37GiB
Name: t6
ID: NKHU:CQYD:PYYH:UNTL:CGCI:K4S4:LL24:DM73:3U6O:DFWQ:7ZXD:QAWF
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
host B
Client:
Debug Mode: false
Server:
Containers: 8
Running: 6
Paused: 0
Stopped: 2
Images: 20
Server Version: 19.03.5
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1160.21.1.el7.x86_64
Operating System: Red Hat Enterprise Linux
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.51GiB
Name: q19
ID: CYVK:ASCG:MU6H:IU7C:YQ7P:KE6L:QX3N:2HPZ:ONNN:UKKE:65GY:NTHU
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.)
VMs
The image container has started normally:
[root@q19 ~ ]$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
43e0fdccad0e .../eesb-monitoring-agent-keep:3.0.1 "sh -c 'java ${JAVA_…" 5 hours ago Up 5 hours 0.0.0.0:32000->8080/tcp uat-itaag20-hmb01-eesb-monitoring-agent-keep-01
but subsequent stop / exec / inspect hung on one of the hosts only.
Another observation, on the problematic host there are:
- 5 docker container
- 6 docker-proxy and
- 5 containerd-shim
processes:
├─docker
│ ├─ff23c8be949c3ba2c14de3208b9388a629bec877ddb96645f6c190cddbf58051
│ │ └─14309 java ... -jar eesb-message-transfer-solution.jar ${@}
│ ├─0c5050bace8478454bc8b531615a8d23f11feabbb3e5ab2b0433aa80de7f5f79
│ │ └─14290 java -jar eesb-message-tester.jar ${@}
│ ├─e3562f77421677aad5985852375e614759367347f09367f76e04d4b9aff005c0
│ │ └─14257 java -jar eesb-consul-synchronizer.jar
│ ├─b38cdb30287cd597a17fe5616c1730f206d6301a03ce0d34de7f9a10bcada50d
│ │ └─14231 java ... -jar eesb-file-transfer-solution.jar
│ └─d48c42642090ae062427d81723a3701d45db7febd82d843c05f3e1096a379b16
│ ├─14146 /usr/bin/dumb-init /bin/sh /usr/local/bin/docker-entrypoint.sh agent -server
│ └─14308 consul agent -data-dir=/consul/data -config-dir=/consul/config -server
...
├─docker.service
│ ├─ 9740 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 32000 -container-ip 172.17.0.7 -container-port 8080
│ ├─13923 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
│ ├─14101 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8630 -container-ip 172.17.0.2 -container-port 8500
│ ├─14131 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30100 -container-ip 172.17.0.3 -container-port 8080
│ ├─14156 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30200 -container-ip 172.17.0.4 -container-port 8080
│ ├─14200 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 31400 -container-ip 172.17.0.5 -container-port 8080
│ └─14249 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 30000 -container-ip 172.17.0.6 -container-port 8080
....
├─containerd.service
│ ├─ 3193 /usr/bin/containerd
│ ├─14112 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/d48c42642090ae062427d81723a3701d45db7febd82d843c05f3e1096a379b16 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14157 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/b38cdb30287cd597a17fe5616c1730f206d6301a03ce0d34de7f9a10bcada50d -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14180 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/e3562f77421677aad5985852375e614759367347f09367f76e04d4b9aff005c0 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ ├─14212 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/0c5050bace8478454bc8b531615a8d23f11feabbb3e5ab2b0433aa80de7f5f79 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
│ └─14270 containerd-shim -namespace moby -workdir /var/lib/containerd/io.containerd.runtime.v1.linux/moby/ff23c8be949c3ba2c14de3208b9388a629bec877ddb96645f6c190cddbf58051 -address /run/containerd/containerd.sock -containerd-binary /usr/bin/containerd -runtime-root /var/run/docker/runtime-runc
....
The observed count inconsistency smells a bit. And indeed the missing container process if for the image.
The inconsistency between docker ps and reality including docker exec/top/stop/inspect points towards #555 .
In search for a root cause I destroyed the repro case.
I discovered that the initially observed docker exec problem is a follow-up problem. Docker starts to misbehave on the host B earlier, at the first docker stop <container> for the container of the image.
In host B log there was:
Feb 1 14:46:26 q19 dockerd: time="2022-02-01T14:46:26.218331451+01:00" level=info msg="Container 442db04451ff10029d686d33770814b836bb01fdd9f22ef626559723e4d6e5d0 failed to exit within 61 seconds of signal 15 - using the force"
The docker stop <container> had not returned (force failed too?) so after 5 mins I killed it.
In further experiment, if I use kill -15 <container-pid> the container terminates instantly. It feeds a hypothesis that the container terminated but docker somehow misses the event so docker tries to force stop and continues waiting for an exit event of the meanwhile terminated container.
In a wide search comparing host A and host B for other possible differences/causes, I found that host B could not open a connection to a fluentd server while host A could open the connection so I switched the docker log driver from fluentd to json-file and it resolved the problem.
So far it looks like the fluentd driver can ruin the container operations without leaving meaningful diagnostical traces.
See also https://github.com/moby/moby/issues/43482