moby icon indicating copy to clipboard operation
moby copied to clipboard

Some ingress ports do not open on the some nodes

Open IgorOhrimenko opened this issue 2 years ago • 1 comments

Description

We have 9 workers and 3 managers. After add a few new server, they cannot to open some ingress ports, not all, but many. Different nodes do not open different ports. For example:

the port 8888 is open and no
for node in manager-00.swarm manager-01.swarm manager-02.swarm node-0{1..9}.swarm; do echo ${node}; nc -vzw1 ${node} 8888; done
manager-00.swarm
nc: connect to manager-00.swarm (10.10.14.3) port 8888 (tcp) timed out: Operation now in progress
manager-01.swarm
Connection to manager-01.swarm (10.10.14.7) 8888 port [tcp/*] succeeded!
manager-02.swarm
nc: connect to manager-02.swarm (10.10.14.8) port 8888 (tcp) timed out: Operation now in progress
node-01.swarm
nc: connect to node-01.swarm (10.10.15.21) port 8888 (tcp) timed out: Operation now in progress
node-02.swarm
Connection to node-02.swarm (10.10.15.22) 8888 port [tcp/*] succeeded!
node-03.swarm
Connection to node-03.swarm (10.10.15.23) 8888 port [tcp/*] succeeded!
node-04.swarm
Connection to node-04.swarm (10.10.15.24) 8888 port [tcp/*] succeeded!
node-05.swarm
Connection to node-05.swarm (10.10.15.25) 8888 port [tcp/*] succeeded!
node-06.swarm
nc: connect to node-06.swarm (10.10.15.26) port 8888 (tcp) timed out: Operation now in progress
node-07.swarm
Connection to node-07.swarm (10.10.15.27) 8888 port [tcp/*] succeeded!
node-08.swarm
Connection to node-08.swarm (10.10.15.28) 8888 port [tcp/*] succeeded!
node-09.swarm
Connection to node-09.swarm (10.10.15.29) 8888 port [tcp/*] succeeded!
the port 8889 is open and no
for node in manager-00.swarm manager-01.swarm manager-02.swarm node-0{1..9}.swarm; do echo ${node}; nc -vzw1 ${node} 8889; done
manager-00.swarm
nc: connect to manager-00.swarm (10.10.14.3) port 8889 (tcp) timed out: Operation now in progress
manager-01.swarm
Connection to manager-01.swarm (10.10.14.7) 8889 port [tcp/*] succeeded!
manager-02.swarm
Connection to manager-02.swarm (10.10.14.8) 8889 port [tcp/*] succeeded!
node-01.swarm
nc: connect to node-01.swarm (10.10.15.21) port 8889 (tcp) timed out: Operation now in progress
node-02.swarm
Connection to node-02.swarm (10.10.15.22) 8889 port [tcp/*] succeeded!
node-03.swarm
Connection to node-03.swarm (10.10.15.23) 8889 port [tcp/*] succeeded!
node-04.swarm
Connection to node-04.swarm (10.10.15.24) 8889 port [tcp/*] succeeded!
node-05.swarm
Connection to node-05.swarm (10.10.15.25) 8889 port [tcp/*] succeeded!
node-06.swarm
nc: connect to node-06.swarm (10.10.15.26) port 8889 (tcp) timed out: Operation now in progress
node-07.swarm
Connection to node-07.swarm (10.10.15.27) 8889 port [tcp/*] succeeded!
node-08.swarm
Connection to node-08.swarm (10.10.15.28) 8889 port [tcp/*] succeeded!
node-09.swarm
Connection to node-09.swarm (10.10.15.29) 8889 port [tcp/*] succeeded!

I found https://github.com/moby/moby/issues/41775 and tried to add node with old Debian 10 and old linux kernel 5.10.0-0.deb10.16-amd64 but no luck. And more, I tried the next kernels: 4.19.0-0.bpo.19-amd64 6.0.0-0.deb11.6-amd64 6.1.0-5-amd64

docker service update --force ServiceName does not fix problem. docker service rm ServiceName and docker service create ServiceName fix problem, but if add another new node it does not open already fixed port. Update service from docker stack deploy can broken port on all nodes even it worked before, this was fixed as docker service rm and create again.

I done docker swarm ca --rotate, changed leader, leave one leader and many other things.

Reproduce

I tried create a new cluster and deploy 3000 services, but the bug is not reproduce. And I tried copied the current swarm like that, systemctl stop docker.service and docker.socket on one manager, edit files /var/lib/docker/swarm/docker-state.json /var/lib/docker/swarm/state.json and run back dockerd, ports that do not open in old swarm open well on clone.

Expected behavior

Ingress ports open well on all the nodes in the cluster.

docker version

Client: Docker Engine - Community
 Version:           23.0.1
 API version:       1.42
 Go version:        go1.19.5
 Git commit:        a5ee5b1
 Built:             Thu Feb  9 19:46:54 2023
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          23.0.1
  API version:      1.42 (minimum version 1.12)
  Go version:       go1.19.5
  Git commit:       bc3805a
  Built:            Thu Feb  9 19:46:54 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 0
 Server Version: 23.0.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: 7wzpf1et0x8bsuc4788p3loxe
  Is Manager: true
  ClusterID: m0ltw0d0vzw528muqqzsavkcc
  Managers: 3
  Nodes: 12
  Default Address Pool: 10.0.0.0/9  172.16.0.0/12
  SubnetSize: 16
  Data Path Port: 4789
  Orchestration:
   Task History Retention Limit: 5
  Raft:
   Snapshot Interval: 10000
   Number of Old Snapshots to Retain: 0
   Heartbeat Tick: 1
   Election Tick: 10
  Dispatcher:
   Heartbeat Period: 5 seconds
  CA Configuration:
   Expiry Duration: 3 months
   Force Rotate: 3
  Autolock Managers: false
  Root Rotation In Progress: false
  Node Address: 10.10.14.8
  Manager Addresses:
   10.10.14.3:2377
   10.10.14.7:2377
   10.10.14.8:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 2456e983eb9e37e47538f59ea18f2043c9a73640
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-21-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 1
 Total Memory: 1.937GiB
 Name: manager-02
 ID: JWYA:23DV:TOWZ:RHED:OKBT:UMQM:GXVQ:PMEQ:XTDP:NN2C:T4LW:NALL
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 206
  Goroutines: 140
  System Time: 2023-03-01T14:54:09.930073165+03:00
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Experimental: false
 Insecure Registries:
  registry.swarm:5000
  127.0.0.0/8
 Live Restore Enabled: false

Additional Info

docker node ls

12 nodes
HOSTNAME        STATUS    AVAILABILITY   MANAGER STATUS   ENGINE VERSION
manager-00      Ready     Pause          Leader           23.0.1
manager-01      Ready     Pause          Reachable        23.0.1
manager-02      Ready     Pause          Reachable        23.0.1
node-01         Ready     Active                          23.0.1
node-02         Ready     Active                          20.10.23
node-03         Ready     Active                          20.10.14
node-04         Ready     Active                          20.10.14
node-05         Ready     Active                          20.10.14
node-06         Ready     Active                          23.0.1
node-07         Ready     Active                          20.10.17
node-08         Ready     Active                          20.10.22
node-09         Ready     Active                          20.10.22

Maybe db is too heavy?
size tasks.db for node-01 - node-09
3.0M    /var/lib/docker/swarm/worker/tasks.db
6.5M    /var/lib/docker/swarm/worker/tasks.db
8.2M    /var/lib/docker/swarm/worker/tasks.db
7.7M    /var/lib/docker/swarm/worker/tasks.db
8.5M    /var/lib/docker/swarm/worker/tasks.db
4.1M    /var/lib/docker/swarm/worker/tasks.db
12M     /var/lib/docker/swarm/worker/tasks.db
6.7M    /var/lib/docker/swarm/worker/tasks.db
7.9M    /var/lib/docker/swarm/worker/tasks.db
But I created a new swarm and increase db up to 40M and ports work.

uname -a
for all 12 nodes
Linux manager-00 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 GNU/Linux
Linux manager-01 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux manager-02 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-01 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-02 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-03 5.10.0-13-amd64 #1 SMP Debian 5.10.106-1 (2022-03-17) x86_64 GNU/Linux
Linux node-04 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-05 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-06 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
Linux node-07 5.10.0-14-amd64 #1 SMP Debian 5.10.113-1 (2022-04-29) x86_64 GNU/Linux
Linux node-08 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
Linux node-09 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64 GNU/Linux
Only manager-00 has Debian GNU/Linux 10 (buster), other nodes have Debian GNU/Linux 11 (bullseye).

dcpdump for broken port
8888
12:42:18.619660 eth0  In  IP 10.10.3.2.53926 > manager-02.swarm.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0
12:42:18.619740 docker_gwbridge Out IP 10.10.3.2.53926 > 172.18.0.2.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0
12:42:18.619746 veth338361a Out IP 10.10.3.2.53926 > 172.18.0.2.8888: Flags [S], seq 1781046931, win 64240, options [mss 1290,sackOK,TS val 548059484 ecr 0,nop,wscale 7], length 0

dcpdump for work port
8889
15:25:40.867276 eth0  In  IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867371 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867378 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [S], seq 4066144061, win 64240, options [mss 1290,sackOK,TS val 557861758 ecr 0,nop,wscale 7], length 0
15:25:40.867887 veth338361a P   IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.867889 docker_gwbridge In  IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.867916 eth0  Out IP manager-02.swarm.8889 > 10.10.3.2.50536: Flags [S.], seq 3584521344, ack 4066144062, win 43338, options [mss 1410,sackOK,TS val 4078806526 ecr 557861758,nop,wscale 10], length 0
15:25:40.896284 eth0  In  IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896329 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896334 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896412 eth0  In  IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896426 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896426 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 557861787 ecr 4078806526], length 0
15:25:40.896588 veth338361a P   IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.896590 docker_gwbridge In  IP 172.18.0.2.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.896597 eth0  Out IP manager-02.swarm.8889 > 10.10.3.2.50536: Flags [F.], seq 1, ack 2, win 43, options [nop,nop,TS val 4078806555 ecr 557861787], length 0
15:25:40.924833 eth0  In  IP 10.10.3.2.50536 > manager-02.swarm.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0
15:25:40.924892 docker_gwbridge Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0
15:25:40.924897 veth338361a Out IP 10.10.3.2.50536 > 172.18.0.2.8889: Flags [.], ack 2, win 502, options [nop,nop,TS val 557861816 ecr 4078806555], length 0

IgorOhrimenko avatar Mar 01 '23 12:03 IgorOhrimenko

What info I have to add? More tcpdump? Or strace?

IgorOhrimenko avatar Mar 06 '23 08:03 IgorOhrimenko