Docker Swarm dnsrr mode dns resolve error
On a 4-node docker swarm cluster, a two-instance service was released using dnsrr mode. I found that one of the dns resolutions was wrong. When using multiple pings, the dns resolved to a container of another service.
Steps to reproduce the issue:
- deploy a replicas 2,endpoint_mode: dnsrr service
- ping docker service name.
Describe the results you received: ping example-service 64 bytes from example-service.1.xxxxx .... ping example-service 64 bytes from another-service.2.xxxx
Describe the results you expected: ping example-service 64 bytes from example-service.1.xxxxx .... ping example-service 64 bytes from example-service.2.xxxx
Additional information you deem important (e.g. issue happens only occasionally):
Output of docker version:
Client: Docker Engine - Community
Version: 19.03.9
API version: 1.40
Go version: go1.13.10
Git commit: 9d988398e7
Built: Fri May 15 00:25:27 2020
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.9
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 9d988398e7
Built: Fri May 15 00:24:05 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
Output of docker info:
Client:
Debug Mode: false
Server:
Containers: 27
Running: 8
Paused: 0
Stopped: 19
Images: 43
Server Version: 19.03.9
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: w76ek2fk07o15jn61pqdpter7
Is Manager: true
ClusterID: kvs9ffq7ndnxavaz8ydbypdb9
Managers: 4
Nodes: 4
Default Address Pool: 172.29.0.0/16
SubnetSize: 24
Data Path Port: 4789
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 10 years
Force Rotate: 0
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 10.200.117.9
Manager Addresses:
10.200.117.10:2377
10.200.117.11:2377
10.200.117.8:2377
10.200.117.9:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-1127.8.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 31.5GiB
Name: wlapp-2.novalocal
ID: 3ALL:WBWE:UFCX:4DW3:Q2HJ:3BO4:F445:VR6O:HSUE:RDO2:Q4C4:LXWD
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
user@fffaeb1e7c29:~$ ping example-service
PING example-service (172.29.4.4) 56(84) bytes of data.
64 bytes from prod_example-service.1.9b707n20gzwnxdayr1j8csls5.prod_default (172.29.4.4): icmp_seq=1 ttl=64 time=0.605 ms
64 bytes from prod_example-service.1.9b707n20gzwnxdayr1j8csls5.prod_default (172.29.4.4): icmp_seq=2 ttl=64 time=0.617 ms
^C
--- example-service ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.605/0.611/0.617/0.006 ms
user@fffaeb1e7c29:~$ ping example-service
PING example-service (172.29.4.77) 56(84) bytes of data.
64 bytes from prod_gateway.2.xi951q7xw10r8lbddud4rtxia.prod_default (172.29.4.77): icmp_seq=1 ttl=64 time=0.109 ms
64 bytes from prod_gateway.2.xi951q7xw10r8lbddud4rtxia.prod_default (172.29.4.77): icmp_seq=2 ttl=64 time=0.081 ms
^C
--- example-service ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1ms
rtt min/avg/max/mdev = 0.081/0.095/0.109/0.014 ms
user@fffaeb1e7c29:~$ cat /etc/resolv.conf
search openstacklocal novalocal
nameserver 127.0.0.11
options ndots:0
user@fffaeb1e7c29:~$ exit
Additional environment details (AWS, VirtualBox, physical, etc.):
can you try dig example-service to see the actual DNS entry on the docker engine DNS? (you may need to use something like an ubuntu container and then apt install dnsutils to install dig). This seems similar to the issue I am having where the DNS entry on the docker engine DNS server is incorrect. #41766
In the linked issue, the A record is always off by minus 1. ie, the actual container IP 10.0.4.8 is listed as 10.0.4.7 in the DNS record. Is the DNS record constantly changing for you?
can you try
dig example-serviceto see the actual DNS entry on the docker engine DNS? (you may need to use something like an ubuntu container and thenapt install dnsutilsto install dig). This seems similar to the issue I am having where the DNS entry on the docker engine DNS server is incorrect. #41766In the linked issue, the A record is always off by minus 1. ie, the actual container IP 10.0.4.8 is listed as 10.0.4.7 in the DNS record. Is the DNS record constantly changing for you?
It may be that when the number of docker swarm service instances was 3, three dns records were recorded, and when I changed the number of instances to 2, the dns records were not deleted, and this IP was occupied by other services, resulting in dns parsing errors.
dig example-service
; <<>> DiG 9.11.5-P4-5.1+deb10u2-Debian <<>> example-service
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49236
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;redis-proxy. IN A
;; ANSWER SECTION:
example-service. 600 IN A 172.29.4.155
example-service. 600 IN A 172.29.4.77
example-service. 600 IN A 172.29.4.154
;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Tue Dec 15 17:10:19 CST 2020
;; MSG SIZE rcvd: 110
update: When the docker swarm service is in dnsrr mode, the dns record will be n+1 of the number of instances, and one of the dns records is wrong.
@Livenux I discovered that the IP address of the service itself will be one minus the container IP. You can see the service vitrual IP by using docket network inspect -v
@Livenux I discovered that the IP address of the service itself will be one minus the container IP. You can see the service vitrual IP by using docket network inspect -v . The -v is required for verbose mode to see the service VIP. Does the DNS record from dig match up with the service IP?
is not a service VIP. is a dns cache(The ipvs mode should be a dns record, and dnsrr should be the number of instances of service, right?). I remove the dnsrr service, recreate new same name ipvs mode service, The wrong IP is still be resolved to the newly created service。
Hello,
I have a similar issue with docker 24.0.6 (this is not the latest, but haven't found anything related to this in recent changelogs).
I have many services (backend) with DNS RR resolution mode. The backend services have 2 replicas. Services are sometime updated (so containers are created/destroyed).
We have connection issues from others services (reverse proxy) to some of these (backend): delay added because of timeout trying to connect to some containers IP before falling back to another IP.
While debugging (docker exec on reverse proxy), we found that more IP addresses than the number of containers are returned by the internal resolver (sometime even 4 IP addresses are returned):
getent ahosts a-problematic-backend-service
172.31.158.180 STREAM a-problematic-backend-service
172.31.158.180 DGRAM
172.31.158.180 RAW
172.31.156.25 STREAM
172.31.156.25 DGRAM
172.31.156.25 RAW
172.31.159.84 STREAM
172.31.159.84 DGRAM
172.31.159.84 RAW
While backend services without issue would only return 2 IP addresses.
To resolve the issue, I tried (without success):
- scaling up then down (or down to 0, then up)
- removing stopped containers of these service
Is there a way to force docker internal DNS resolver to do a re-synchronization or other workaround to make it forget the wrong IP addresses (without deleting/recreating the services)?
I have a similar issue with docker 26.1.4. Nothing help except restart docker service (