backend: increase connection count before connect
There is a race between the time we connect to a backend and when we increase the connection count. This race is not normally observed, but when .max_connections property is used we could end up creating more connections to the backend than this limit permits. This is problematic as we end up consuming more system resources, e.g., file-descriptors, than reserved for such backends.
The problem is that we release the backend lock and wait for the connect to finish. This can take upto connect_timeout, which has a default timeout of 3.5 seconds, and only then do we count the connection as consumed towards the .max_connections property.
This can create a recipe for bursty behavior where a backend already struggling could be overwhelmed by a significant number of connections. Then, as the connection count goes higher than the ceiling we either fail fetches or queue them up. However, as soon as we once again drop below this limit we will again over-commit and the cycle repeats.
The tests cases uses a blackhole backend to simulate the connect_timeout, this excersice the race but under normal circumstances the likelihood for the race to occur depend on the traffic (fetches), networking conditions, and the performance of the backend.
The solution to the problem is to increase the connection count before connecting to the backend, and if we fail to get a connection to the backend we revert the count before failing the fetch. This will make any fethes past the .max_connections limit to either outright fail, or observe the presence of a queue to the backend.
bugwash:
- skipped, was too close to the bugwash this week.
There are some actual concerns, but in general, I am not happy with the amount of changes this introduces left and right of the pretty basic feature itself. #4030 already contains incremental changes to some regions by different authors, and now a third author comes along and edits it all over again? Um no, my take here is that the PR should aim for a small diff unless for good reasons.
I understand your concerns. I initially only implemented the bug fix, but when I did that I saw the opportunity to make this function simpler. If you do not like the extra work I did here, I can propose that we split this PR into two chunks. One containing a few commits and the bug fix, while the other contains the rest of the commits.
Sorry for being too late for the #4030 party, this is not an effort to re-implement #4030 . As explained here: https://github.com/varnishcache/varnish-cache/pull/4154#discussion_r1715131791 The bug fix allows us to make the connection queue implementation more compact and I think we should go the extra mile to keep it simple.
I can propose that we split this PR into two chunks.
Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.
I can propose that we split this PR into two chunks.
Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.
Cool, then I will do that. This includes the 4 first commits: https://github.com/varnishcache/varnish-cache/compare/master...varnishcache:varnish-cache:fe747ac16ef15ef228a5f245fab5bd6372c01f30?w=1 If you are OK with this, then I can go ahead and rebase. I will wait with the rest of the commits until after these 4 commits has been merged.
I ended up reducing the fix to only one commit with small diff. @nigoroll Please go ahead and let me know if you are happy with it now. Once merged, I will create a new PR with the remaining commits.
As commenting on unchanged lies is not possible: In addition to the issues pointed out, if this was the right approach, an
n_conndecrement is missing for thebo->htc == NULLcase and thevbe_connwait_dequeue_locked()is still missing. But at any rate, I think the simplification attempted here is not possible.
Oh, this was a miss on my part when I rebased the original PR to remove the other commits I forgot to change this. I will push the fix as a new commit.
BTW, this is not a simplification attempt, this is fixing the bug outlined in the description of the commit. The simplification will come later, but let's handle this separately in another PR.
As of edfeb73ebbfbea7db487105a0e7ae12f6706ced8, b68 fails for me locally.
looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:
**** v1 vsl| 1002 BereqMethod b GET
**** v1 vsl| 1002 BereqURL b /blackhole
**** v1 vsl| 1002 BereqProtocol b HTTP/1.1
**** v1 vsl| 1002 BereqHeader b Host: 127.0.0.1
**** v1 vsl| 1002 BereqHeader b User-Agent: c1
**** v1 vsl| 1002 BereqHeader b X-Forwarded-For: 127.0.0.1
**** v1 vsl| 1002 BereqHeader b Via: 1.1 v1 (Varnish/trunk)
**** v1 vsl| 1002 BereqHeader b Accept-Encoding: gzip
**** v1 vsl| 1002 BereqHeader b X-Varnish: 1002
**** v1 vsl| 1002 VCL_call b BACKEND_FETCH
**** v1 vsl| 1002 VCL_return b fetch
**** v1 vsl| 1002 Timestamp b Fetch: 1724257240.120347 0.000023 0.000023
**** v1 vsl| 1002 FetchError b backend default: fail errno 101 (Network is unreachable)
**** v1 vsl| 1002 Timestamp b Beresp: 1724257240.130065 0.009741 0.009718
**** v1 vsl| 1002 Timestamp b Error: 1724257240.130068 0.009744 0.000003
...
**** dT 7.624
---- v1 Not true: MAIN.backend_busy (0) == 1 (1)
Maybe it's better to use a backend with a delay?
Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0
As of edfeb73, b68 fails for me locally.
Interesting.
looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:
**** v1 vsl| 1002 BereqMethod b GET **** v1 vsl| 1002 BereqURL b /blackhole **** v1 vsl| 1002 BereqProtocol b HTTP/1.1 **** v1 vsl| 1002 BereqHeader b Host: 127.0.0.1 **** v1 vsl| 1002 BereqHeader b User-Agent: c1 **** v1 vsl| 1002 BereqHeader b X-Forwarded-For: 127.0.0.1 **** v1 vsl| 1002 BereqHeader b Via: 1.1 v1 (Varnish/trunk) **** v1 vsl| 1002 BereqHeader b Accept-Encoding: gzip **** v1 vsl| 1002 BereqHeader b X-Varnish: 1002 **** v1 vsl| 1002 VCL_call b BACKEND_FETCH **** v1 vsl| 1002 VCL_return b fetch **** v1 vsl| 1002 Timestamp b Fetch: 1724257240.120347 0.000023 0.000023 **** v1 vsl| 1002 FetchError b backend default: fail errno 101 (Network is unreachable) **** v1 vsl| 1002 Timestamp b Beresp: 1724257240.130065 0.009741 0.009718 **** v1 vsl| 1002 Timestamp b Error: 1724257240.130068 0.009744 0.000003 ... **** dT 7.624 ---- v1 Not true: MAIN.backend_busy (0) == 1 (1)Maybe it's better to use a backend with a delay?
Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0
Thanks for testing, the original test cases had this part:
backend default {
# Non-routeable IPv4 address to simulate connect_timeout
# XXX: $\{blackhole_backend}
.host = "10.255.255.1 80";
[...]
}
That's when I found ${bad_ip} something I could use instead. So, I rewrote it.
Could you try to change the host in the test case to what I wrote here and try again?
The original idea was introduce a macro for it as the IP is non-routeable over the Internet.
Backend with a delay would be too late (I think), you want the VTCP_Connect() to hang until the timeout.
I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?
I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?
Nope, this would exercise the first_byte_timeout and not the connect_timeout. That is too late. When a backend listens for connections, a connect will not fail unless the backlog overflows. The kernel could decide to refuse the connection for other reasons, but this is mostly outside the control of the application.
A connect syscall will make the TCP kernel stack to initiate the three-way handshake (SYN, SYN-ACK, ACK), a successfully call to connect does not require the other end to "accept" the connection. Once the handshake completes, the connection is added to the listen backlog. It stays there until the server "accept" the connection.
In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.
In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.
Can't we do that with raw sockets maybe ?
In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.
Can't we do that with raw sockets maybe ?
I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.
In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.
Can't we do that with raw sockets maybe ?
I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.
What I meant is that we are currently limited by the kernel taking care of connection establishment and doing the handshake for us, and that's what we want to prevent. I'm not suggesting to implement a protocol here, we only want to receive the TCP packet containing the SYN flag from the client and simply ignore it so that we can emulate a connection timeout, that's why I was suggesting raw sockets, but I'm not sure if this can be achieved though.
what about this?
Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?
can't we already do this with a vtc_server? if not, I would think the adjustments should be fairly minimal.
what about this?
Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?
This wouldn't work with a regular TCP socket, as soon as you have a listening socket, all the connect calls to that socket would succeed (regardless if you call accept or not), or fail immediately if the backlog is full.
also with uds?
also with uds?
This is not how the connect syscall works (at least not on Linux). The connect syscall associate the socket to an address, and gives you back a file descriptor (fd). You can send off your request before your peer has accepted you, but the response will not return before you have been dequeued from the backlog and a response is written to the socket. You wait for your response in read/recv... It is true that the connect syscall can hang, but that usually is due to a failing handshake at the transport level.
Patch to add 5s sleep before we accept the client in the server block:
$ git diff
diff --git a/bin/varnishtest/vtc_server.c b/bin/varnishtest/vtc_server.c
index 45913aaa7..66e2bf9f1 100644
--- a/bin/varnishtest/vtc_server.c
+++ b/bin/varnishtest/vtc_server.c
@@ -244,6 +244,7 @@ server_conn(void *priv, struct vtclog *vl)
CAST_OBJ_NOTNULL(s, priv, SERVER_MAGIC);
+ sleep(5);
addr = (void*)&addr_s;
l = sizeof addr_s;
fd = accept(s->sock, addr, &l);
Here is the VTC:
varnishtest "UDS connect"
server s1 -listen "${tmpdir}/uds" {
rxreq
txresp
} -start
client c1 -connect "${s1_sock}" {
timeout 10
txreq
rxresp
expect resp.status == 200
} -run
Logs:
$ bin/varnishtest/varnishtest -i uds_connect.vtc -v
**** dT 0.000
* top TEST uds_connect.vtc starting
**** top extmacro def pkg_version=trunk
**** top extmacro def pkg_branch=trunk
**** top extmacro def pwd=/home/asadsa/varnish-cache
**** top extmacro def date(...)
**** top extmacro def string(...)
**** top extmacro def localhost=127.0.0.1
**** top extmacro def bad_backend=127.0.0.1:35537
**** top extmacro def listen_addr=127.0.0.1:0
**** top extmacro def bad_ip=192.0.2.255
**** top extmacro def topbuild=/home/asadsa/varnish-cache
**** top extmacro def topsrc=/home/asadsa/varnish-cache
**** top macro def testdir=/home/asadsa/varnish-cache
**** top macro def tmpdir=/tmp/vtc.2188850.221ec1e4
**** top macro def vtcid=vtc.2188850.221ec1e4
** top === varnishtest "UDS connect"
* top VTEST UDS connect
** top === server s1 -listen "${tmpdir}/uds" {
** s1 Starting server
**** s1 macro def s1_addr=0.0.0.0
**** s1 macro def s1_port=0
**** s1 macro def s1_sock=/tmp/vtc.2188850.221ec1e4/uds
* s1 Listen on /tmp/vtc.2188850.221ec1e4/uds
** top === client c1 -connect "${s1_sock}" {
** c1 Starting client
** c1 Waiting for client
** s1 Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
** c1 Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
*** c1 Connect to /tmp/vtc.2188850.221ec1e4/uds
*** c1 connected fd 4 to /tmp/vtc.2188850.221ec1e4/uds
** c1 === timeout 10
** c1 === txreq
**** dT 0.001
**** c1 txreq|GET / HTTP/1.1\r
**** c1 txreq|Host: 127.0.0.1\r
**** c1 txreq|User-Agent: c1\r
**** c1 txreq|\r
** c1 === rxresp
**** dT 5.001
*** s1 accepted fd 5 0.0.0.0 0
** s1 === rxreq
**** s1 rxhdr|GET / HTTP/1.1\r
**** s1 rxhdr|Host: 127.0.0.1\r
**** s1 rxhdr|User-Agent: c1\r
**** s1 rxhdr|\r
**** s1 rxhdrlen = 51
**** s1 http[ 0] |GET
**** s1 http[ 1] |/
**** s1 http[ 2] |HTTP/1.1
**** s1 http[ 3] |Host: 127.0.0.1
**** s1 http[ 4] |User-Agent: c1
**** s1 bodylen = 0
** s1 === txresp
**** s1 txresp|HTTP/1.1 200 OK\r
**** s1 txresp|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** s1 txresp|Server: s1\r
**** s1 txresp|Content-Length: 0\r
**** s1 txresp|\r
**** dT 5.002
*** s1 shutting fd 5
** s1 Ending
**** c1 rxhdr|HTTP/1.1 200 OK\r
**** c1 rxhdr|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** c1 rxhdr|Server: s1\r
**** c1 rxhdr|Content-Length: 0\r
**** c1 rxhdr|\r
**** c1 rxhdrlen = 87
**** c1 http[ 0] |HTTP/1.1
**** c1 http[ 1] |200
**** c1 http[ 2] |OK
**** c1 http[ 3] |Date: Mon, 02 Sep 2024 11:46:09 GMT
**** c1 http[ 4] |Server: s1
**** c1 http[ 5] |Content-Length: 0
**** c1 bodylen = 0
** c1 === expect resp.status == 200
**** c1 EXPECT resp.status (200) == "200" match
*** c1 closing fd 4
** c1 Ending
* top RESETTING after uds_connect.vtc
** s1 Waiting for server (3/-1)
* top TEST uds_connect.vtc completed
# top TEST uds_connect.vtc passed (5.004)
Take a note that the client connected before accept(); The 5s delay was observed during rxresp
bugwash: add usage instructions to the vtc and skip it, other than the merge conflicts I think this is ready.
Done.
bugwash: merge