LCOW: Intermittent DNS resolution failures with Alpine containers
Preface - I haven't yet debugged this issue enough to know precisely where the issue lies. I do know that I can very trivially reproduce the problem and wanted to at least get the ticket filed / conversation going. It may be related to some combination of:
- LCOW (or LCOW image / kernel / opengcs / etc)
- Alpine 3.9
- Environment - containers are running inside a Server 2019 Hyper-V VM that has nested virtualization enabled
- Docker version / some nuance of the Docker DNS resolver
I'm pretty sure this has something to do with Alpine in particular, since running the failing scenario with Ubuntu containers instead does not fail.
docker info
Client:
Debug Mode: false
Plugins:
app: Docker Application (Docker Inc., v0.8.0-beta2)
buildx: Build with BuildKit (Docker Inc., v0.2.0-6-g509c4b6-tp)
Server:
Containers: 2
Running: 0
Paused: 0
Stopped: 2
Images: 138
Server Version: master-dockerproject-2019-04-28
Storage Driver: windowsfilter (windows) lcow (linux)
Windows:
LCOW:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: inactive
Default Isolation: hyperv
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Windows 10 Enterprise Version 1809 (OS Build 17763.437)
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 16GiB
Name: ci-lcow-prod-1
ID: 0ac02c9d-aaba-42f4-8749-5a64af3068d8
Docker Root Dir: C:\ProgramData\docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
The LCOW image is built from https://github.com/linuxkit/lcow/commit/d5dfdbc7d754f17b2c9ba05e867dc96aa192200b - it includes kernel 4.19.27 amongst other bits. There is an updated kernel image PR that was merged containing newer versions of OpenGCS, Alpine, kernel and runc BUT when I built it, it didn't launch containers and I had to revert (more info in https://github.com/linuxkit/lcow/pull/45#issuecomment-487786154)
compose file to demonstrate the problem
version: '3'
services:
foo:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "while true; do nslookup bar.internal && sleep 1s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "while true; do nslookup foo.internal && sleep 1s; done"
networks:
default:
aliases:
- bar.internal
Output from compose up
The problem is that DNS resolution failures occur pretty regularly - i.e. foo cannot resolve bar.internal fail and vice versa. While the log also shows some successes, there are a number of failures as well (which vary depending on each run).
PS C:\source\alpine-test> docker-compose -f .\docker-compose-bad.yml up
Creating network "alpine-test_default" with the default driver
Creating alpine-test_bar_1 ... done
Creating alpine-test_foo_1 ... done
Attaching to alpine-test_foo_1, alpine-test_bar_1
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve 'foo.internal': Name does not resolve
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | nslookup: can't resolve 'foo.internal': Name does not resolve
bar_1 |
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 |
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
foo_1 |
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25 alpine-test_foo_1.alpine-test_default
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | nslookup: can't resolve 'bar.internal': Name does not resolve
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
foo_1 |
foo_1 | nslookup: can't resolve '(null)': Name does not resolve
foo_1 | Name: bar.internal
foo_1 | Address 1: 172.18.76.19 alpine-test_bar_1.alpine-test_default
bar_1 |
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
Gracefully stopping... (press Ctrl+C again to force)
Workaround
One way to workaround the problem is to have the Alpine container perform a dig against the host, which presumably will cache the DNS record for future nslookup calls
compose file
version: '3'
services:
foo:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "apk add bind-tools; dig bar.internal; while true; do nslookup bar.internal; sleep 2s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: alpine:latest
dns_search: internal
entrypoint: sh -c "apk add bind-tools; dig foo.internal; while true; do nslookup foo.internal; sleep 2s; done"
networks:
default:
aliases:
- bar.internal
Output from compose up
The nslookup results have changed quite a bit from:
bar_1 |
bar_1 | nslookup: can't resolve '(null)': Name does not resolve
bar_1 | Name: foo.internal
bar_1 | Address 1: 172.18.67.25
To
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
Here's a longer run from the above compose file showing that nslookup no longer fails intermittently.
PS C:\source\alpine-test> docker-compose up
Creating network "alpine-test_default" with the default driver
Creating alpine-test_bar_1 ... done
Creating alpine-test_foo_1 ... done
Attaching to alpine-test_foo_1, alpine-test_bar_1
foo_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
bar_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
foo_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
bar_1 | fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
foo_1 | (1/10) Installing libgcc (8.3.0-r0)
bar_1 | (1/10) Installing libgcc (8.3.0-r0)
bar_1 | (2/10) Installing krb5-conf (1.0-r1)
foo_1 | (2/10) Installing krb5-conf (1.0-r1)
bar_1 | (3/10) Installing libcom_err (1.44.5-r0)
foo_1 | (3/10) Installing libcom_err (1.44.5-r0)
bar_1 | (4/10) Installing keyutils-libs (1.6-r0)
foo_1 | (4/10) Installing keyutils-libs (1.6-r0)
bar_1 | (5/10) Installing libverto (0.3.0-r1)
bar_1 | (6/10) Installing krb5-libs (1.15.5-r0)
foo_1 | (5/10) Installing libverto (0.3.0-r1)
foo_1 | (6/10) Installing krb5-libs (1.15.5-r0)
bar_1 | (7/10) Installing json-c (0.13.1-r0)
bar_1 | (8/10) Installing libxml2 (2.9.9-r1)
foo_1 | (7/10) Installing json-c (0.13.1-r0)
foo_1 | (8/10) Installing libxml2 (2.9.9-r1)
bar_1 | (9/10) Installing bind-libs (9.12.4_p1-r1)
foo_1 | (9/10) Installing bind-libs (9.12.4_p1-r1)
foo_1 | (10/10) Installing bind-tools (9.12.4_p1-r1)
bar_1 | (10/10) Installing bind-tools (9.12.4_p1-r1)
foo_1 | Executing busybox-1.29.3-r10.trigger
bar_1 | Executing busybox-1.29.3-r10.trigger
bar_1 | OK: 12 MiB in 24 packages
foo_1 | OK: 12 MiB in 24 packages
foo_1 |
foo_1 | ; <<>> DiG 9.12.4-P1 <<>> bar.internal
foo_1 | ;; global options: +cmd
foo_1 | ;; Got answer:
foo_1 | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62166
foo_1 | ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
foo_1 |
foo_1 | ;; QUESTION SECTION:
foo_1 | ;bar.internal. IN A
foo_1 |
foo_1 | ;; ANSWER SECTION:
foo_1 | bar.internal. 600 IN A 172.25.137.174
foo_1 |
foo_1 | ;; Query time: 0 msec
foo_1 | ;; SERVER: 172.25.128.1#53(172.25.128.1)
foo_1 | ;; WHEN: Fri May 03 18:26:29 UTC 2019
foo_1 | ;; MSG SIZE rcvd: 58
foo_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 |
bar_1 | ; <<>> DiG 9.12.4-P1 <<>> foo.internal
bar_1 | ;; global options: +cmd
bar_1 | ;; Got answer:
bar_1 | ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34929
bar_1 | ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
bar_1 |
bar_1 | ;; QUESTION SECTION:
bar_1 | ;foo.internal. IN A
bar_1 |
bar_1 | ;; ANSWER SECTION:
bar_1 | foo.internal. 600 IN A 172.25.139.149
bar_1 |
bar_1 | ;; Query time: 0 msec
bar_1 | ;; SERVER: 172.25.128.1#53(172.25.128.1)
bar_1 | ;; WHEN: Fri May 03 18:26:29 UTC 2019
bar_1 | ;; MSG SIZE rcvd: 58
bar_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
foo_1 | Server: 172.25.128.1
foo_1 | Address: 172.25.128.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.25.137.174
foo_1 |
bar_1 | Server: 172.25.128.1
bar_1 | Address: 172.25.128.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.25.139.149
bar_1 |
Ubuntu results
Compose file
version: '3'
services:
foo:
image: ubuntu:latest
dns_search: internal
entrypoint: sh -c "apt-get update && apt-get install -y dnsutils; while true; do nslookup 'bar.internal'; sleep 2s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: ubuntu:latest
dns_search: internal
entrypoint: sh -c "apt-get update && apt-get install -y dnsutils; while true; do nslookup 'foo.internal'; sleep 2s; done"
networks:
default:
aliases:
- bar.internal
I'll spare the full log here, but switching to an Ubuntu container and nslookup succeeds from the onset:
foo_1 | Server: 172.30.16.1
foo_1 | Address: 172.30.16.1#53
foo_1 |
foo_1 | Non-authoritative answer:
foo_1 | Name: bar.internal
foo_1 | Address: 172.30.18.190
foo_1 |
bar_1 | Server: 172.30.16.1
bar_1 | Address: 172.30.16.1#53
bar_1 |
bar_1 | Non-authoritative answer:
bar_1 | Name: foo.internal
bar_1 | Address: 172.30.28.25
bar_1 |
I just verified that I'm not seeing the same behavior with Alpine 3.9 on my Mac Docker (I get the reverse pointer fail of nslookup: can't resolve '(null)': Name does not resolve, but I'm always able to resolve queries).
Not sure who the right MS folks are to contact - @jhowardmsft or @jterry75?
This might end up being a ticket to file in opengcs project.
docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 1
Server Version: 18.09.2
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9754871865f7fe2f4e74d43e2fc7ccd237edcbce
runc version: 09c8266bf2fcf9519a651b04ae54c967b9ab86ec
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.125-linuxkit
Operating System: Docker for Mac
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952GiB
Name: linuxkit-025000000001
ID: WOHB:ZTHF:LEYI:UJM6:XU5Y:KRRI:2TLV:Z352:WLSD:HYPI:IUBB:K27H
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 26
Goroutines: 53
System Time: 2019-05-03T18:48:28.685819469Z
EventsListeners: 2
HTTP Proxy: gateway.docker.internal:3128
HTTPS Proxy: gateway.docker.internal:3129
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
@Iristyle Thank you very much for this detailed description of the problem, and workaround. In case the following is useful context for anyone looking to fix this issue, I see this issue while running:
- Windows Server 2019 (1809)
- Docker EE 17.10.0-ee-preview-3
- LCOW containers
- docker-compose using a nat network it spins up automatically
(I am running this antique docker version because to the best of my knowledge it is the only one that supports LCOW on Win Server.)
I have a sneaking suspicion that the culprit is busybox, and that using version 1.28 will work; 1.29 is broken in terms of nslookup.
can you ptal @pradipd
@daschott - Would you mind triaging?
@mamezgeb what is the latest supported Docker version with LCOW on Server? @3dbrows is running old 17.10 preview version. I know we have https://docs.docker.com/docker-for-windows/wsl-tech-preview/ for Desktop and experimental feature on Docker-CE, but what is the current recommendation for server?
@3dbrows did it work to try older busybox image? Do other container images not work as well?
@daschott Busybox 1.28 works (wider discussion here: https://github.com/docker-library/busybox/issues/48). I've seen this nslookup problem in any image containing this version of busybox. My workaround is (on container startup) to use a command that installs dig and uses it to find the other containers that I'm interested in, and write out a hosts file with their IPs. I am aware that this is brittle in case the target IPs change, but it's all I can think of for now. Example docker-compose script to achieve this:
nsq_create_topic:
image: nsqio/nsq:v1.2.0
dns: "8.8.8.8"
command: >
sh -c "
apk add bind-tools; echo \"$$(dig nsqd +short) nsqd\" >> /etc/hosts; cat /etc/hosts;
wget -qO- --post-data='' 'nsqd:4151/topic/create?topic=worker'"
I specify my DNS resolver (for apk to use) because my target machine is an Azure VM with Private DNS running on 127.7.7.7 (this is the resolver that DHCP in Azure specifies), which the container cannot access, therefore won't resolve anything much if I don't supply this. (I have no control over the Azure environment.)
Thanks @3dbrows for confirming. Is it possible at all to rebuild on top of busybox 1.28?
@Iristyle the point that this appears to work reliably on ubuntu image + on alpine dig works but only nslookup fails is interesting. Can you confirm which busybox version is being used by alpine? Dig works reliably?
@daschott Could try that, best way might be to obtain latest busybox by upgrading the Alpine base image - the version numbers are as follows:
Alpine 3.10 has busybox 1.30.1-r3: https://pkgs.alpinelinux.org/packages?name=busybox&branch=v3.10 Alpine 3.9 has busybox 1.29.3-r10: https://pkgs.alpinelinux.org/packages?name=busybox&branch=v3.9
I do not right this minute have access to my LCOW/WinServer box, but I imagine a good test would be to take @Iristyle 's script above and modify like this:
Expected to fail:
services:
foo:
image: alpine:3.9
dns_search: internal
entrypoint: sh -c "while true; do nslookup bar.internal && sleep 1s; done"
networks:
default:
aliases:
- foo.internal
bar:
image: alpine:3.9
dns_search: internal
entrypoint: sh -c "while true; do nslookup foo.internal && sleep 1s; done"
networks:
default:
aliases:
- bar.internal
Expected to work:
Copy-paste the above but change 3.9 to 3.10. Bear in mind that at the time of @Iristyle 's initial post, latest meant 3.9. So, in fact, the repro script he wrote might now work as-is, given that latest now means 3.11.