varnish-cache backend: increase connection count before connect

There is a race between the time we connect to a backend and when we increase the connection count. This race is not normally observed, but when .max_connections property is used we could end up creating more connections to the backend than this limit permits. This is problematic as we end up consuming more system resources, e.g., file-descriptors, than reserved for such backends.

The problem is that we release the backend lock and wait for the connect to finish. This can take upto connect_timeout, which has a default timeout of 3.5 seconds, and only then do we count the connection as consumed towards the .max_connections property.

This can create a recipe for bursty behavior where a backend already struggling could be overwhelmed by a significant number of connections. Then, as the connection count goes higher than the ceiling we either fail fetches or queue them up. However, as soon as we once again drop below this limit we will again over-commit and the cycle repeats.

The tests cases uses a blackhole backend to simulate the connect_timeout, this excersice the race but under normal circumstances the likelihood for the race to occur depend on the traffic (fetches), networking conditions, and the performance of the backend.

The solution to the problem is to increase the connection count before connecting to the backend, and if we fail to get a connection to the backend we revert the count before failing the fetch. This will make any fethes past the .max_connections limit to either outright fail, or observe the presence of a queue to the backend.

Aug 12 '24 12:08 asadsa92

bugwash:

skipped, was too close to the bugwash this week.

Aug 12 '24 14:08 asadsa92

There are some actual concerns, but in general, I am not happy with the amount of changes this introduces left and right of the pretty basic feature itself. #4030 already contains incremental changes to some regions by different authors, and now a third author comes along and edits it all over again? Um no, my take here is that the PR should aim for a small diff unless for good reasons.

I understand your concerns. I initially only implemented the bug fix, but when I did that I saw the opportunity to make this function simpler. If you do not like the extra work I did here, I can propose that we split this PR into two chunks. One containing a few commits and the bug fix, while the other contains the rest of the commits.

Sorry for being too late for the #4030 party, this is not an effort to re-implement #4030 . As explained here: https://github.com/varnishcache/varnish-cache/pull/4154#discussion_r1715131791 The bug fix allows us to make the connection queue implementation more compact and I think we should go the extra mile to keep it simple.

Aug 13 '24 11:08 asadsa92

I can propose that we split this PR into two chunks.

Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.

Aug 16 '24 10:08 nigoroll

I can propose that we split this PR into two chunks.

Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.

Cool, then I will do that. This includes the 4 first commits: https://github.com/varnishcache/varnish-cache/compare/master...varnishcache:varnish-cache:fe747ac16ef15ef228a5f245fab5bd6372c01f30?w=1 If you are OK with this, then I can go ahead and rebase. I will wait with the rest of the commits until after these 4 commits has been merged.

Aug 16 '24 13:08 asadsa92

I ended up reducing the fix to only one commit with small diff. @nigoroll Please go ahead and let me know if you are happy with it now. Once merged, I will create a new PR with the remaining commits.

Aug 19 '24 09:08 asadsa92

As commenting on unchanged lies is not possible: In addition to the issues pointed out, if this was the right approach, an n_conn decrement is missing for the bo->htc == NULL case and the vbe_connwait_dequeue_locked() is still missing. But at any rate, I think the simplification attempted here is not possible.

Oh, this was a miss on my part when I rebased the original PR to remove the other commits I forgot to change this. I will push the fix as a new commit.

BTW, this is not a simplification attempt, this is fixing the bug outlined in the description of the commit. The simplification will come later, but let's handle this separately in another PR.

Aug 20 '24 13:08 asadsa92

As of edfeb73ebbfbea7db487105a0e7ae12f6706ced8, b68 fails for me locally.

looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:

**** v1    vsl|       1002 BereqMethod     b GET
**** v1    vsl|       1002 BereqURL        b /blackhole
**** v1    vsl|       1002 BereqProtocol   b HTTP/1.1
**** v1    vsl|       1002 BereqHeader     b Host: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b User-Agent: c1
**** v1    vsl|       1002 BereqHeader     b X-Forwarded-For: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b Via: 1.1 v1 (Varnish/trunk)
**** v1    vsl|       1002 BereqHeader     b Accept-Encoding: gzip
**** v1    vsl|       1002 BereqHeader     b X-Varnish: 1002
**** v1    vsl|       1002 VCL_call        b BACKEND_FETCH
**** v1    vsl|       1002 VCL_return      b fetch
**** v1    vsl|       1002 Timestamp       b Fetch: 1724257240.120347 0.000023 0.000023
**** v1    vsl|       1002 FetchError      b backend default: fail errno 101 (Network is unreachable)
**** v1    vsl|       1002 Timestamp       b Beresp: 1724257240.130065 0.009741 0.009718
**** v1    vsl|       1002 Timestamp       b Error: 1724257240.130068 0.009744 0.000003
...
**** dT    7.624
---- v1    Not true: MAIN.backend_busy (0) == 1 (1)

Maybe it's better to use a backend with a delay?

Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0

Aug 21 '24 16:08 nigoroll

As of edfeb73, b68 fails for me locally.

Interesting.

looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:

**** v1    vsl|       1002 BereqMethod     b GET
**** v1    vsl|       1002 BereqURL        b /blackhole
**** v1    vsl|       1002 BereqProtocol   b HTTP/1.1
**** v1    vsl|       1002 BereqHeader     b Host: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b User-Agent: c1
**** v1    vsl|       1002 BereqHeader     b X-Forwarded-For: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b Via: 1.1 v1 (Varnish/trunk)
**** v1    vsl|       1002 BereqHeader     b Accept-Encoding: gzip
**** v1    vsl|       1002 BereqHeader     b X-Varnish: 1002
**** v1    vsl|       1002 VCL_call        b BACKEND_FETCH
**** v1    vsl|       1002 VCL_return      b fetch
**** v1    vsl|       1002 Timestamp       b Fetch: 1724257240.120347 0.000023 0.000023
**** v1    vsl|       1002 FetchError      b backend default: fail errno 101 (Network is unreachable)
**** v1    vsl|       1002 Timestamp       b Beresp: 1724257240.130065 0.009741 0.009718
**** v1    vsl|       1002 Timestamp       b Error: 1724257240.130068 0.009744 0.000003
...
**** dT    7.624
---- v1    Not true: MAIN.backend_busy (0) == 1 (1)

Maybe it's better to use a backend with a delay?

Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0

Thanks for testing, the original test cases had this part:

backend default {
        # Non-routeable IPv4 address to simulate connect_timeout
        # XXX: $\{blackhole_backend}
        .host = "10.255.255.1 80";
        [...]
}

That's when I found ${bad_ip} something I could use instead. So, I rewrote it. Could you try to change the host in the test case to what I wrote here and try again? The original idea was introduce a macro for it as the IP is non-routeable over the Internet. Backend with a delay would be too late (I think), you want the VTCP_Connect() to hang until the timeout.

Aug 22 '24 13:08 asadsa92

I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

Aug 22 '24 15:08 nigoroll

I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

Nope, this would exercise the first_byte_timeout and not the connect_timeout. That is too late. When a backend listens for connections, a connect will not fail unless the backlog overflows. The kernel could decide to refuse the connection for other reasons, but this is mostly outside the control of the application.

A connect syscall will make the TCP kernel stack to initiate the three-way handshake (SYN, SYN-ACK, ACK), a successfully call to connect does not require the other end to "accept" the connection. Once the handshake completes, the connection is added to the listen backlog. It stays there until the server "accept" the connection.

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Aug 23 '24 09:08 asadsa92

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

Aug 23 '24 10:08 walid-git

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.

Aug 23 '24 11:08 asadsa92

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.

What I meant is that we are currently limited by the kernel taking care of connection establishment and doing the handshake for us, and that's what we want to prevent. I'm not suggesting to implement a protocol here, we only want to receive the TCP packet containing the SYN flag from the client and simply ignore it so that we can emulate a connection timeout, that's why I was suggesting raw sockets, but I'm not sure if this can be achieved though.

Aug 23 '24 12:08 walid-git

what about this?

Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

can't we already do this with a vtc_server? if not, I would think the adjustments should be fairly minimal.

Aug 26 '24 13:08 nigoroll

what about this?

Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

This wouldn't work with a regular TCP socket, as soon as you have a listening socket, all the connect calls to that socket would succeed (regardless if you call accept or not), or fail immediately if the backlog is full.

Aug 26 '24 13:08 walid-git

also with uds?

Aug 26 '24 13:08 nigoroll

also with uds?

This is not how the connect syscall works (at least not on Linux). The connect syscall associate the socket to an address, and gives you back a file descriptor (fd). You can send off your request before your peer has accepted you, but the response will not return before you have been dequeued from the backlog and a response is written to the socket. You wait for your response in read/recv... It is true that the connect syscall can hang, but that usually is due to a failing handshake at the transport level.

Patch to add 5s sleep before we accept the client in the server block:

$ git diff
diff --git a/bin/varnishtest/vtc_server.c b/bin/varnishtest/vtc_server.c
index 45913aaa7..66e2bf9f1 100644
--- a/bin/varnishtest/vtc_server.c
+++ b/bin/varnishtest/vtc_server.c
@@ -244,6 +244,7 @@ server_conn(void *priv, struct vtclog *vl)
 
        CAST_OBJ_NOTNULL(s, priv, SERVER_MAGIC);
 
+       sleep(5);
        addr = (void*)&addr_s;
        l = sizeof addr_s;
        fd = accept(s->sock, addr, &l);

Here is the VTC:

varnishtest "UDS connect"

server s1 -listen "${tmpdir}/uds" {
	rxreq
	txresp
} -start

client c1 -connect "${s1_sock}" {
	timeout 10
	txreq
	rxresp
	expect resp.status == 200
} -run

Logs:

$ bin/varnishtest/varnishtest -i uds_connect.vtc -v
**** dT    0.000
*    top   TEST uds_connect.vtc starting
**** top   extmacro def pkg_version=trunk
**** top   extmacro def pkg_branch=trunk
**** top   extmacro def pwd=/home/asadsa/varnish-cache
**** top   extmacro def date(...)
**** top   extmacro def string(...)
**** top   extmacro def localhost=127.0.0.1
**** top   extmacro def bad_backend=127.0.0.1:35537
**** top   extmacro def listen_addr=127.0.0.1:0
**** top   extmacro def bad_ip=192.0.2.255
**** top   extmacro def topbuild=/home/asadsa/varnish-cache
**** top   extmacro def topsrc=/home/asadsa/varnish-cache
**** top   macro def testdir=/home/asadsa/varnish-cache
**** top   macro def tmpdir=/tmp/vtc.2188850.221ec1e4
**** top   macro def vtcid=vtc.2188850.221ec1e4
**   top   === varnishtest "UDS connect"
*    top   VTEST UDS connect
**   top   === server s1 -listen "${tmpdir}/uds" {
**   s1    Starting server
**** s1    macro def s1_addr=0.0.0.0
**** s1    macro def s1_port=0
**** s1    macro def s1_sock=/tmp/vtc.2188850.221ec1e4/uds
*    s1    Listen on /tmp/vtc.2188850.221ec1e4/uds
**   top   === client c1 -connect "${s1_sock}" {
**   c1    Starting client
**   c1    Waiting for client
**   s1    Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
**   c1    Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
***  c1    Connect to /tmp/vtc.2188850.221ec1e4/uds
***  c1    connected fd 4 to /tmp/vtc.2188850.221ec1e4/uds
**   c1    === timeout 10
**   c1    === txreq
**** dT    0.001
**** c1    txreq|GET / HTTP/1.1\r
**** c1    txreq|Host: 127.0.0.1\r
**** c1    txreq|User-Agent: c1\r
**** c1    txreq|\r
**   c1    === rxresp
**** dT    5.001
***  s1    accepted fd 5 0.0.0.0 0
**   s1    === rxreq
**** s1    rxhdr|GET / HTTP/1.1\r
**** s1    rxhdr|Host: 127.0.0.1\r
**** s1    rxhdr|User-Agent: c1\r
**** s1    rxhdr|\r
**** s1    rxhdrlen = 51
**** s1    http[ 0] |GET
**** s1    http[ 1] |/
**** s1    http[ 2] |HTTP/1.1
**** s1    http[ 3] |Host: 127.0.0.1
**** s1    http[ 4] |User-Agent: c1
**** s1    bodylen = 0
**   s1    === txresp
**** s1    txresp|HTTP/1.1 200 OK\r
**** s1    txresp|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** s1    txresp|Server: s1\r
**** s1    txresp|Content-Length: 0\r
**** s1    txresp|\r
**** dT    5.002
***  s1    shutting fd 5
**   s1    Ending
**** c1    rxhdr|HTTP/1.1 200 OK\r
**** c1    rxhdr|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** c1    rxhdr|Server: s1\r
**** c1    rxhdr|Content-Length: 0\r
**** c1    rxhdr|\r
**** c1    rxhdrlen = 87
**** c1    http[ 0] |HTTP/1.1
**** c1    http[ 1] |200
**** c1    http[ 2] |OK
**** c1    http[ 3] |Date: Mon, 02 Sep 2024 11:46:09 GMT
**** c1    http[ 4] |Server: s1
**** c1    http[ 5] |Content-Length: 0
**** c1    bodylen = 0
**   c1    === expect resp.status == 200
**** c1    EXPECT resp.status (200) == "200" match
***  c1    closing fd 4
**   c1    Ending
*    top   RESETTING after uds_connect.vtc
**   s1    Waiting for server (3/-1)
*    top   TEST uds_connect.vtc completed
#    top  TEST uds_connect.vtc passed (5.004)

Take a note that the client connected before accept(); The 5s delay was observed during rxresp

Sep 02 '24 12:09 asadsa92

bugwash: add usage instructions to the vtc and skip it, other than the merge conflicts I think this is ready.

Sep 02 '24 14:09 nigoroll

Done.

Sep 03 '24 10:09 asadsa92

bugwash: merge

Sep 23 '24 13:09 nigoroll