for-win Docker network connection time outs to host over time

[x] I have tried with the latest version of my channel (Stable or Edge)
[x] I have uploaded Diagnostics
Diagnostics ID: 7E746511-651C-4A74-8C84-91189E8962C1/20201006161122

Expected behavior

I would expect services running inside Docker containers in a WSL backend to be able to reliably communicate with applications running on the host, even with frequent polling

Actual behavior

Due to https://github.com/docker/for-win/issues/8590, I have to run some applications that require high download speeds on the host. I have multiple applications inside Docker containers running inside a Docker bridge network that poll this application every few seconds. When launching WSL, the applications are able to communicate reliably, but this connection deteriorates over time, and after 1-2 days, I notice frequent connection timed out responses from the application running on the host. Running wsl --shutdown and restarting the Docker daemon fixes the issue temporarily. Shifting applications out of Docker and onto the host fixes their communication issues as well. It may be related to the overall network issues linked above.

To be clear, it can still connect. It just starts timing out more and more often the longer the network/containers have been up.

Information

Windows Version: 2004 (OS Build 19041.508)
Docker Desktop Version: 2.4.1.0 (48583)
Are you running inside a virtualized Windows e.g. on a cloud server or on a mac VM: No

I have had this problem ever since starting to use Docker for Windows with the WSL2 backend.

Steps to reproduce the behavior

Run an application on the Windows host. I tried with NZBGet (host ip: 192.168.1.2)
Poll this application from within a Docker container inside a Docker bridge network living within WSL2. I polled 192.168.1.2:6789 every few seconds
Check back in a day to see if the connection is timing out more frequently.
Restart WSL/Docker daemon, notice that the connection is suddenly more reliable though it will begin to deteriorate over time again

Oct 06 '20 18:10 rg9400

This seems to improve if you use the recommended host.docker.internal option instead of using the IP of the host machine directly

Oct 09 '20 18:10 rg9400

Further update on this. While the above does prolong the deterioration, it still eventually happens. After 4-5 days, timeouts start occurring at increasing frequency, with it eventually reaching a point where timeouts are happening on almost every few calls, requiring a full restart of WSL and Docker to function.

Oct 19 '20 13:10 rg9400

We have the same issue

Using 2.4.0.0
We use host.docker.internal

We have a service running on the host.

If i try to hit host.docker.internal from within a linux container i can always get it to trip up eventually after say 5000 curl requests to http:\host.docker.internal\service (it timesout for one request)

If i try http:\host.docker.internal\service from the host, it works flawlessly even after 10000 curl requests

Sometimes, intermittently, and we can't find out why, it starts to fail much more frequently (like maybe every 100 curl requests)

Something is up with the networking...

Here is a very simple test to show what's going on: ezgif-3-7115a7f3b7ab

Oct 21 '20 20:10 markoueis

In my limited testing, i created a loopback adapter and used it instead. I created an ip 10.0.75.2 and used it instead. It's much more reliable. It's an ugly work around but it might work at least to help show where the issue might be.

Oct 27 '20 18:10 markoueis

Hey guys, this is still happening pretty consistently. Is anyone looking at the reliability/performance of these things? Is this the wrong place to post this?

Dec 23 '20 15:12 markoueis

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them

I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)
$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0
which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0.

Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak.

When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure.

The vpnkit diagnostics has a single TCP flow registered:
> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0
which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

Dec 23 '20 16:12 rg9400

Woah, thanks for this update @rg9400. Glad you got it on their radar. So your work around is to restart docker and wsl --shutdown? I've been trying to use another IP (loopback adapter) as opposed to host.docker.internal or whatever host.docker.internal points to. But I'm not 100% sure that solves the problem permanently. Maybe its just a new IP so it will work for a little and then deteriorate again over time. Based on your explanation of the root cause, that might indeed be the case.

Dec 23 '20 16:12 markoueis

Yeah, for now I am just living with it and restarting WSL/Docker every now and then when the connection timeouts become too frequent and unbearable.

Dec 23 '20 17:12 rg9400

What can we do to get this worked on. Is there work happening on it? or a ticket we can follow? This still bugs us quite consistently.

Mar 03 '21 19:03 markoueis

I want to keep this thread alive as this is a massive pain for folks especially because they don't know its happening. This needs to become more reliable.

Here is a newer diagnostic id: F4D29FA0-6778-40B8-B312-BADEA278BB3B/20210521171355

Also discovered that just killing vpnkit.exe in task manager reduces the problem. It restarts almost instantly and connections resume much better without having to restart containers or anything. But problem eventually reoccurs.

May 19 '21 19:05 markoueis

We have about 15 services in our docker-compose file and all of them do an npm install. A cacheless build is impossible because it tries to build all the services at once and the npm install steps timeout because trying to download that many packages just kills bandwidth.

I'm not using the --parallel flag I've set the following environment variables:

COMPOSE_HTTP_TIMEOUT=240
COMPOSE_PARALLEL_LIMIT=2

But non of this seems to change the behavior

Aug 03 '21 08:08 stormmuller

This happens on macOS too, in fact quite reliably after ~7 minutes and ~13,000 requests of hitting a HTTP server:

Server:

$ python3 -mhttp.server 8015

Client (siege):

$ cat <<EOF > siegerc
timeout = 1
failures = 1
EOF
$ docker run --rm -v $(pwd)/siegerc:/tmp/siegerc -t funkygibbon/siege --rc=/tmp/siegerc -t2000s -c2 -d0.1 http://host.docker.internal:8015/api/foo

Output:

New configuration template added to /root/.siege
Run siege -C to view the current settings in that file
** SIEGE 4.0.4
** Preparing 2 concurrent users for battle.
The server is now under siege...[alert] socket: select and discovered it's not ready sock.c:351: Connection timed out
[alert] socket: read check timed out(1) sock.c:240: Connection timed out
siege aborted due to excessive socket failure; you
can change the failure threshold in $HOME/.siegerc

Transactions:		       13949 hits
Availability:		       99.99 %
Elapsed time:		      378.89 secs
Data transferred:	        6.24 MB
Response time:		        0.00 secs
Transaction rate:	       36.82 trans/sec
Throughput:		        0.02 MB/sec
Concurrency:		        0.10
Successful transactions:           0
Failed transactions:	           1
Longest transaction:	        0.05
Shortest transaction:	        0.00

What's interesting is that it gets progressively worse from there, the timeouts happen more and more frequently. Restarting the HTTP server doesn't help, but restarting it on another port does (e.g. from 8019 -> 8020). From there you get another 7 minutes of 100% success before it starts degrading again.

I tried adding an IP alias to my loopback adapter and hitting that instead of host.docker.internal but it had the same behavior (i.e. degraded after 7 minutes). The same goes for using the IP (192.168.65.2) and skipping the DNS resolution.

Aug 23 '21 03:08 bradleyayers

This issue remains unresolved. The devs indicated it required major rework, but I haven't heard back from them in 6 months on the progress.

Oct 14 '21 15:10 rg9400

Issues go stale after 90 days of inactivity. Mark the issue as fresh with /remove-lifecycle stale comment. Stale issues will be closed after an additional 30 days of inactivity.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so.

Send feedback to Docker Community Slack channels #docker-for-mac or #docker-for-windows. /lifecycle stale

Jan 12 '22 01:01 docker-robott

/remove-lifecycle stale

Jan 12 '22 01:01 rg9400

I am also affected by this issue. I thought at one point it was because of TCP keepalive on sockets, and the sockets not being closed as fast as they are opened, thus a exhausting the max number of available sockets. But the problem doesn't go away even if my containers stop opening connections for a while, only a restart of docker and wsl seems to fix this. This issue should be on high priority...

Mar 16 '22 04:03 zadirion

I cannot connect from a container to a host port even using telnet. Network mode is bridge, which is default, but "host" mode also doesn't work.

I tried to guess host IP, but also I tried this: extra_hosts: - "host.docker.internal:host-gateway" Both options didn't work.

Telnet connection from host machine to this host port does work well.

In previous Docker versions it was working fine! Seems it's broken since some update maybe from 2021-2022.

Apr 18 '22 21:04 artzwinger

Upd. It was my Ubuntu UFW that was blocking containers from connecting to host ports.

Apr 18 '22 21:04 artzwinger

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

May 04 '22 22:05 raarts

We have reports of this occurring across teams on Windows and macOS as well. We have no reports of this issue occuring on Linux.

Someone noticed that on macOS, simply waiting ~15mins often alleviates the problem.

May 05 '22 09:05 bernhof

We're also experiencing this (using host.docker.internal) on Docker Desktop for Windows. Strangely enough, Docker version up to 4.5.1 seem to work fine, but versions 4.6.x and 4.7.x instantly bring up the problem. Connections work for some time, but then the timeouts start. All checks of "C:\Program Files\Docker\Docker\resources\com.docker.diagnose.exe" check pass.

May 11 '22 07:05 metacity

I'm experiencing the same problem with increasing amount of timeouts over time while using host.docker.internal.

May 29 '22 07:05 RomanShumkov

I'm also experiencing the same problem. Downgrade to 4.5.1 looks that solves the issue.

May 30 '22 15:05 stamosv

Any update on this issue? I'm experiencing the same. Restarting the container does not fix it. Only restarting the daemon/host resolves it.

Jun 05 '22 02:06 levimatheri

We seem to have resolved the issue on Windows (but not Mac)

We previously had the following configuration in our compose file to allow containers to reach the host using "host.docker.internal" on Windows, Mac and Linux hosts:

extra_hosts:
- "host.docker.internal:host-gateway"

Removing this configuration resolved the time out issue on Windows (but can obviously cause other problems). Mac users still have time out issues, though.

Jun 08 '22 07:06 bernhof

We are encountering issues with this on MacOS 12.0. We determined that our developers using Docker Desktop 4.3.0 have not encountered the issue. We are currently testing downgrading Docker Desktop to 4.3.0. This seems to have resolved the problem so far. We have not yet tested going all the way back up to 4.5.1 as noted earlier in this thread. We also have not yet observed this issue in Docker on our x86 Ubuntu environments.

Jun 14 '22 15:06 raeganbarker

I was able to send this via their support and have them reproduce the issue. They diagnosed the cause, but said it would involve some major refactoring, so they didn't have a target fix date. Below is the issue as mentioned by them
I can reproduce the bug now. If I query the vpnkit diagnostics with this program https://github.com/djs55/bug-repros/tree/main/tools/vpnkit-diagnostics while the connection is stuck then I observe: (for my particular repro the port number was 51580. I discovered this using wireshark to explore the trace)
$ tcpdump -r capture\\all.pcap port 51580
15:57:03.021934 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195077730 ecr 0,nop,wscale 7], length 0
15:57:04.064094 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195078771 ecr 0,nop,wscale 7], length 0 15:57:06.111633 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195080819 ecr 0,nop,wscale 7], length 0
15:57:10.143908 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195084851 ecr 0,nop,wscale 7], length 0
15:57:18.464142 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195093171 ecr 0,nop,wscale 7], length 0
15:57:34.848536 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195109555 ecr 0,nop,wscale 7], length 0
15:58:07.103411 IP 192.168.65.3.51580 > 192.168.65.2.6789: Flags [S], seq 609899732, win 64240, options [mss 1460,sackOK,TS val 2195141811 ecr 0,nop,wscale 7], length 0
which is a stuck TCP handshake from the Linux point of view. The same thing is probably visible in a live trace from docker run -it --privileged --net=host djs55/tcpdump -n -i eth0. Using sysinternals process explorer to examine the vpnkit.exe process, I only see 1 TCP connection at a time (although a larger than ideal number of UDP connections which are DNS-related I think). There's no sign of a resource leak. When this manifests I can still establish other TCP connections and run the test again -- the impact seems limited to the 1 handshake failure. The vpnkit diagnostics has a single TCP flow registered:
> cat .\flows
TCP 192.168.65.3:51580 > 192.168.65.2:6789 socket = open last_active_time = 1605023899.0
which means that vpnkit itself thinks the flow is connected, although the handshake never completed.

@rg9400 this was SUPER helpful... I started running into the same issue. I use dockerized Jupyter on Docker for Windows for a significant amount of my day-to-day work and have been getting CONSTANT timeout errors when I run notebooks from the beginning. I was also restarting Docker a ton, but after finding this comment, I found a way to pretty consistently "unstick" things (though it's definitely still annoying):

C:\Users\blah\blah> tasklist | findstr vpnkit.exe
C:\Users\blah\blah> taskkill /F /pid <pid of vpnkit>

And then I give it just a sec and when the cell tries again to reestablish connections, it's good.

I don't really think this is a viable solution for folks running processes that are constantly establishing connections, but it works for me currently for Jupyter (once I get data, I don't really need more connections for the notebook though).

Any update from folks working on this issue? I found something that sounds similar from 2019, but it doesn't look like anyone is making any effort to resolve it on the vpnkit side that I can find right off from searching the issues.

Jun 15 '22 16:06 jrpope2014

Having this exact problem on MacOS. Restarting Docker fixes the problem (for a while).

I'm having the same issue, running docker 4.9.1 on mac and I'm facing the issue very often. After restarting Docker it works again but is not a long term solution as you mentioned ...

Jun 20 '22 15:06 hakey1408

I'm also affected by this issue. It seems to hang after just a few minutes and causes network timeouts. restarting docker fixes it for a few more minutes. it's practically unusable...

Jun 22 '22 08:06 razorman8669

I tried to use the internal IP of docker host (instead of "host.docker.internal"), but the problem still occurs: in few minutes, the network connection timeout starts again. Just stop and start the container doesn´t fix the issue, only if i recreate the container. I´m working with Windows Docker Desktop, v.4.9.1, updated today!

Jun 22 '22 13:06 rossinineto