Some ingress ports do not open on the some nodes
Description
We have 9 workers and 3 managers. After add a few new server, they cannot to open some ingress ports, not all, but many. Different nodes do not open different ports. For example:
the port 8888 is open and no
for node in manager-00.swarm manager-01.swarm manager-02.swarm node-0{1..9}.swarm; do echo ${node}; nc -vzw1 ${node} 8888; done
manager-00.swarm
nc: connect to manager-00.swarm (10.10.14.3) port 8888 (tcp) timed out: Operation now in progress
manager-01.swarm
Connection to manager-01.swarm (10.10.14.7) 8888 port [tcp/*] succeeded!
manager-02.swarm
nc: connect to manager-02.swarm (10.10.14.8) port 8888 (tcp) timed out: Operation now in progress
node-01.swarm
nc: connect to node-01.swarm (10.10.15.21) port 8888 (tcp) timed out: Operation now in progress
node-02.swarm
Connection to node-02.swarm (10.10.15.22) 8888 port [tcp/*] succeeded!
node-03.swarm
Connection to node-03.swarm (10.10.15.23) 8888 port [tcp/*] succeeded!
node-04.swarm
Connection to node-04.swarm (10.10.15.24) 8888 port [tcp/*] succeeded!
node-05.swarm
Connection to node-05.swarm (10.10.15.25) 8888 port [tcp/*] succeeded!
node-06.swarm
nc: connect to node-06.swarm (10.10.15.26) port 8888 (tcp) timed out: Operation now in progress
node-07.swarm
Connection to node-07.swarm (10.10.15.27) 8888 port [tcp/*] succeeded!
node-08.swarm
Connection to node-08.swarm (10.10.15.28) 8888 port [tcp/*] succeeded!
node-09.swarm
Connection to node-09.swarm (10.10.15.29) 8888 port [tcp/*] succeeded!
the port 8889 is open and no
for node in manager-00.swarm manager-01.swarm manager-02.swarm node-0{1..9}.swarm; do echo ${node}; nc -vzw1 ${node} 8889; done
manager-00.swarm
nc: connect to manager-00.swarm (10.10.14.3) port 8889 (tcp) timed out: Operation now in progress
manager-01.swarm
Connection to manager-01.swarm (10.10.14.7) 8889 port [tcp/*] succeeded!
manager-02.swarm
Connection to manager-02.swarm (10.10.14.8) 8889 port [tcp/*] succeeded!
node-01.swarm
nc: connect to node-01.swarm (10.10.15.21) port 8889 (tcp) timed out: Operation now in progress
node-02.swarm
Connection to node-02.swarm (10.10.15.22) 8889 port [tcp/*] succeeded!
node-03.swarm
Connection to node-03.swarm (10.10.15.23) 8889 port [tcp/*] succeeded!
node-04.swarm
Connection to node-04.swarm (10.10.15.24) 8889 port [tcp/*] succeeded!
node-05.swarm
Connection to node-05.swarm (10.10.15.25) 8889 port [tcp/*] succeeded!
node-06.swarm
nc: connect to node-06.swarm (10.10.15.26) port 8889 (tcp) timed out: Operation now in progress
node-07.swarm
Connection to node-07.swarm (10.10.15.27) 8889 port [tcp/*] succeeded!
node-08.swarm
Connection to node-08.swarm (10.10.15.28) 8889 port [tcp/*] succeeded!
node-09.swarm
Connection to node-09.swarm (10.10.15.29) 8889 port [tcp/*] succeeded!
I found https://github.com/moby/moby/issues/41775 and tried to add node with old Debian 10 and old linux kernel 5.10.0-0.deb10.16-amd64 but no luck. And more, I tried the next kernels: 4.19.0-0.bpo.19-amd64 6.0.0-0.deb11.6-amd64 6.1.0-5-amd64
docker service update --force ServiceName does not fix problem.
docker service rm ServiceName and docker service create ServiceName fix problem, but if add another new node it does not open already fixed port.
Update service from docker stack deploy can broken port on all nodes even it worked before, this was fixed as docker service rm and create again.
I done docker swarm ca --rotate, changed leader, leave one leader and many other things.
Reproduce
I tried create a new cluster and deploy 3000 services, but the bug is not reproduce. And I tried copied the current swarm like that, systemctl stop docker.service and docker.socket on one manager, edit files /var/lib/docker/swarm/docker-state.json /var/lib/docker/swarm/state.json and run back dockerd, ports that do not open in old swarm open well on clone.
Expected behavior
Ingress ports open well on all the nodes in the cluster.
docker version
Client: Docker Engine - Community
Version: 23.0.1
API version: 1.42
Go version: go1.19.5
Git commit: a5ee5b1
Built: Thu Feb 9 19:46:54 2023
OS/Arch: linux/amd64
Context: default
Server: Docker Engine - Community
Engine:
Version: 23.0.1
API version: 1.42 (minimum version 1.12)
Go version: go1.19.5
Git commit: bc3805a
Built: Thu Feb 9 19:46:54 2023
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.18
GitCommit: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0
docker info
Client:
Context: default
Debug Mode: false
Plugins:
scan: Docker Scan (Docker Inc.)
Version: v0.23.0
Path: /usr/libexec/docker/cli-plugins/docker-scan
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 23.0.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: 7wzpf1et0x8bsuc4788p3loxe
Is Manager: true
ClusterID: m0ltw0d0vzw528muqqzsavkcc
Managers: 3
Nodes: 12
Default Address Pool: 10.0.0.0/9 172.16.0.0/12
SubnetSize: 16
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 3
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.10.14.8
Manager Addresses:
10.10.14.3:2377
10.10.14.7:2377
10.10.14.8:2377
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
runc version: v1.1.4-0-g5fd4c4d
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 5.10.0-21-amd64
Operating System: Debian GNU/Linux 11 (bullseye)
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 1.937GiB
Name: manager-02
ID: JWYA:23DV:TOWZ:RHED:OKBT:UMQM:GXVQ:PMEQ:XTDP:NN2C:T4LW:NALL
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 206
Goroutines: 140
System Time: 2023-03-01T14:54:09.930073165+03:00
EventsListeners: 0
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
registry.swarm:5000
127.0.0.0/8
Live Restore Enabled: false
Additional Info
docker node ls
12 nodes
HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
manager-00 Ready Pause Leader 23.0.1
manager-01 Ready Pause Reachable 23.0.1
manager-02 Ready Pause Reachable 23.0.1
node-01 Ready Active 23.0.1
node-02 Ready Active 20.10.23
node-03 Ready Active 20.10.14
node-04 Ready Active 20.10.14
node-05 Ready Active 20.10.14
node-06 Ready Active 23.0.1
node-07 Ready Active 20.10.17
node-08 Ready Active 20.10.22
node-09 Ready Active 20.10.22
size tasks.db for node-01 - node-09
3.0M /var/lib/docker/swarm/worker/tasks.db
6.5M /var/lib/docker/swarm/worker/tasks.db
8.2M /var/lib/docker/swarm/worker/tasks.db
7.7M /var/lib/docker/swarm/worker/tasks.db
8.5M /var/lib/docker/swarm/worker/tasks.db
4.1M /var/lib/docker/swarm/worker/tasks.db
12M /var/lib/docker/swarm/worker/tasks.db
6.7M /var/lib/docker/swarm/worker/tasks.db
7.9M /var/lib/docker/swarm/worker/tasks.db
for all 12 nodes
Linux manager-00 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 GNU/Linux
Linux manager-01 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux manager-02 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-01 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-02 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-03 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
Linux node-04 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-05 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-06 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-07 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-08 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
Linux node-09 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
8888
12:42:18.619660 eth0 In IP 10.10.3.2.53926 > manager-02.swarm.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0
12:42:18.619740 docker_gwbridge Out IP 10.10.3.2.53926 > 172.18.0.2.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0
12:42:18.619746 veth338361a Out IP 10.10.3.2.53926 > 172.18.0.2.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0
8889
15:25:40.867276 eth0 In IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867371 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867378 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867887 veth338361a P IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.867889 docker_gwbridge In IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.867916 eth0 Out IP manager-02.swarm.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.896284 eth0 In IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896329 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896334 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896412 eth0 In IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896426 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896426 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896588 veth338361a P IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.896590 docker_gwbridge In IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.896597 eth0 Out IP manager-02.swarm.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.924833 eth0 In IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0
15:25:40.924892 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0
15:25:40.924897 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0
What info I have to add? More tcpdump? Or strace?