Unexpected VCL subroutine flow
Expected Behavior
Waitinglist requests should not be considered as cached.
Current Behavior
After purge request to URL, next request is getting MISS and backend fetch as expected, but next requests are getting HIT VCL_call although they are in waitinglist.
Possible Solution
No response
Steps to Reproduce (for bugs)
No response
Context
We are trying to troubleshoot scenario, when varnishlog is showing that object was served from cache but in reality fetch time was quite long due to waitinglist.
We are setting special response header called X-Cache-Status:
sub vcl_deliver {
if (obj.hits > 0) {
set resp.http.X-UPS = "V";
set resp.http.X-Cache-Status = "HIT";
unset resp.http.X-UPS-RESTIME;
} else {
set resp.http.X-Cache-Status = "MISS";
set resp.http.X-UPS = "M";
}
}
Varnishlog output showing cache purge -> miss on first call -> hit on second call while request is in waitinglist so in reality it's not HIT, so my question is why VCL_HIT is called here.
Cache purge request
* << Request >> 862691
- Begin req 862690 rxreq
- Timestamp Start: 1723626741.517823 0.000000 0.000000
- Timestamp Req: 1723626741.517823 0.000000 0.000000
- VCL_use varnish_1723557604
- ReqStart 10.26.31.38 54314 a0
- ReqMethod PURGE
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader X-Magento-Tags-Pattern: ((^|,)cat_c_p_22(,|$))
- ReqHeader Host: XXX
- ReqHeader X-Forwarded-For: XXX
- VCL_call RECV
- ReqHeader disable_http_basic: 1
- VCL_return synth
- VCL_call HASH
- VCL_return lookup
- RespProtocol HTTP/1.1
- RespStatus 200
- RespReason Purged
- RespHeader Date: Wed, 14 Aug 2024 09:12:21 GMT
- RespHeader Server: Varnish
- RespHeader X-Varnish: 862691
- VCL_call SYNTH
- RespHeader Content-Type: text/html; charset=utf-8
- RespHeader Retry-After: 5
- VCL_return deliver
- Timestamp Process: 1723626741.517868 0.000045 0.000045
- RespHeader Content-Length: 241
- Storage malloc Transient
- Filters
- RespHeader Connection: keep-alive
- Timestamp Resp: 1723626741.517892 0.000069 0.000024
- ReqAcct 95 0 95 197 241 438
- End
Fresh call to /napoje - this is MISS as expected
* << Request >> 2387316
- Begin req 862673 rxreq
- Timestamp Start: 1723626735.944309 0.000000 0.000000
- Timestamp Req: 1723626735.944309 0.000000 0.000000
- VCL_use varnish_1723557604
- ReqStart 10.26.19.1 34820 a0
- ReqMethod GET
- ReqURL /napoje
- ReqProtocol HTTP/1.1
- ReqHeader Host: XXX
- ReqHeader X-Request-ID: 5b2a2082bd338f875428ed54a37213d4
- ReqHeader X-Real-IP: XXX
- ReqHeader X-Forwarded-For: XXX
- ReqHeader X-Forwarded-Host: XXX
- ReqHeader X-Forwarded-Port: 443
- ReqHeader X-Forwarded-Proto: https
- ReqHeader X-Forwarded-Scheme: https
- ReqHeader X-Scheme: https
- ReqHeader X-Original-Forwarded-For: XXX
- ReqHeader Sec-CH-UA: "Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"
- ReqHeader sec-ch-ua-mobile: ?0
- ReqHeader sec-ch-ua-platform: "Linux"
- ReqHeader upgrade-insecure-requests: 1
- ReqHeader User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
- ReqHeader Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
- ReqHeader sec-fetch-site: same-origin
- ReqHeader sec-fetch-mode: navigate
- ReqHeader sec-fetch-user: ?1
- ReqHeader sec-fetch-dest: document
- ReqHeader Accept-Language: pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7
- ReqHeader priority: u=0, i
- ReqHeader Accept-Encoding: gzip
- ReqHeader Cache-Control: no-cache, max-age=0
- VCL_call RECV
- ReqUnset X-Forwarded-For: XXX
- ReqHeader X-Forwarded-For: XXX
- ReqUnset X-Real-IP: XXX
- ReqHeader X-Real-Ip: XXX
- ReqURL /napoje
- ReqUnset Accept-Encoding: gzip
- ReqHeader Accept-Encoding: gzip
- ReqURL /napoje
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- VCL_call MISS
- VCL_return fetch
- Link bereq 2387317 fetch
- Timestamp Fetch: 1723626742.356807 6.412498 6.412498
- RespProtocol HTTP/1.1
- RespStatus 200
- RespReason OK
- RespHeader Server: nginx
- RespHeader Date: Wed, 14 Aug 2024 09:12:22 GMT
- RespHeader Content-Type: text/html; charset=UTF-8
- RespHeader Edge-Control: bypass-cache
- RespHeader X-Content-Type-Options: nosniff
- RespHeader X-XSS-Protection: 1; mode=block
- RespHeader X-Frame-Options: SAMEORIGIN
- RespHeader Vary: Accept-Encoding
- RespHeader X-UPS-RESTIME: 6.354
- RespHeader X-UPS: M
- RespHeader Pragma: no-cache
- RespHeader Expires: -1
- RespHeader Cache-Control: no-store, no-cache, must-revalidate, max-age=0
- RespHeader backend: default
- RespHeader Content-Encoding: gzip
- RespHeader X-Varnish: 2387316
- RespHeader Age: 0
- RespHeader Accept-Ranges: bytes
- VCL_call DELIVER
- RespUnset Age: 0
- RespHeader X-Cache-Status: MISS
- RespUnset X-UPS: M
- RespHeader X-UPS: M
- RespUnset Server: nginx
- RespUnset X-Varnish: 2387316
- VCL_return deliver
- Timestamp Process: 1723626742.356844 6.412534 0.000036
- Filters
- RespHeader Content-Length: 114887
- RespHeader Connection: keep-alive
- Timestamp Resp: 1723626742.356931 6.412621 0.000087
- ReqAcct 2416 0 2416 6231 114887 121118
- End
Second call to /napoje (in the meantime previous request is still fetching page from backend ~6 seconds)
* << Request >> 1538196
- Begin req 1538181 rxreq
- Timestamp Start: 1723626736.342492 0.000000 0.000000
- Timestamp Req: 1723626736.342492 0.000000 0.000000
- VCL_use varnish_1723557604
- ReqStart 10.26.11.78 54734 a0
- ReqMethod GET
- ReqURL /napoje
- ReqProtocol HTTP/1.1
- ReqHeader Host: XXX
- ReqHeader X-Request-ID: cf20f0e2e9ef40211ae49485149abe50
- ReqHeader X-Real-IP: XXX
- ReqHeader X-Forwarded-For: XXX
- ReqHeader X-Forwarded-Host: XXX
- ReqHeader X-Forwarded-Port: 443
- ReqHeader X-Forwarded-Proto: https
- ReqHeader X-Forwarded-Scheme: https
- ReqHeader X-Scheme: https
- ReqHeader X-Original-Forwarded-For: XXX
- ReqHeader Sec-CH-UA: "Not)A;Brand";v="99", "Android WebView";v="127", "Chromium";v="127"
- ReqHeader sec-ch-ua-mobile: ?1
- ReqHeader sec-ch-ua-platform: "Android"
- ReqHeader upgrade-insecure-requests: 1
- ReqHeader User-Agent: Mozilla/5.0 (Linux; Android 11; SM-A202F Build/RP1A.200720.012; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/127.0.6533.97 Mobile Safari/537.36 [FB_IAB/FB4A;FBAV/476.0.0.49.74;]
- ReqHeader Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
- ReqHeader Accept-Language: cs-CZ,cs;q=0.9,en-US;q=0.8,en;q=0.7
- ReqHeader priority: u=0, i
- VCL_call RECV
- ReqUnset X-Forwarded-For: XXX
- ReqHeader X-Forwarded-For:XXX
- ReqUnset X-Real-IP: XXX
- ReqHeader X-Real-Ip: XXX
- ReqURL /napoje
- ReqUnset Accept-Encoding: gzip
- ReqHeader Accept-Encoding: gzip
- ReqURL /napoje
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- Hit 2387317 86405.956800 60.000000 0.000000 114887
- Timestamp Waitinglist: 1723626742.356823 6.014330 6.014330
- VCL_call HIT
- VCL_return deliver
- RespProtocol HTTP/1.1
- RespStatus 200
- RespReason OK
- RespHeader Server: nginx
- RespHeader Date: Wed, 14 Aug 2024 09:12:22 GMT
- RespHeader Content-Type: text/html; charset=UTF-8
- RespHeader Edge-Control: bypass-cache
- RespHeader X-Content-Type-Options: nosniff
- RespHeader X-XSS-Protection: 1; mode=block
- RespHeader X-Frame-Options: SAMEORIGIN
- RespHeader Vary: Accept-Encoding
- RespHeader X-UPS-RESTIME: 6.354
- RespHeader X-UPS: M
- RespHeader Pragma: no-cache
- RespHeader Expires: -1
- RespHeader Cache-Control: no-store, no-cache, must-revalidate, max-age=0
- RespHeader backend: default
- RespHeader Content-Encoding: gzip
- RespHeader X-Varnish: 1538196 2387317
- RespHeader Age: 0
- RespHeader Accept-Ranges: bytes
- VCL_call DELIVER
- RespUnset Age: 0
- RespUnset X-UPS: M
- RespHeader X-UPS: V
- RespHeader X-Cache-Status: HIT
- RespUnset X-UPS-RESTIME: 6.354
- RespHeader Referrer-Policy: no-referrer-when-downgrade
- RespUnset Server: nginx
- RespUnset X-Varnish: 1538196 2387317
- VCL_return deliver
- Timestamp Process: 1723626742.356853 6.014360 0.000030
- Filters
- RespHeader Content-Length: 114887
- RespHeader Connection: keep-alive
- Timestamp Resp: 1723626742.356943 6.014450 0.000089
- ReqAcct 3634 0 3634 6208 114887 121095
- End
Varnish Cache version
7.5.0
Operating system
Debian 12
Source of binary packages used (if any)
No response
slightly relevant blog post: https://info.varnish-software.com/blog/using-obj-hits
it is a hit though: the object is found in the cache, albeit in a currently incomplete form. One could argue that vcl_hit could offer a obj.still_downloading boolean property, though
side note: the bereq (for example via varnishlog -g request) would have been helpful.
Test case (written by @AlveElde , modified it slithly myself):
varnishtest "Cache HIT on shortlived error object with waitinglist"
server s1 {
rxreq
delay 10
} -start
varnish v1 -vcl+backend {
sub vcl_backend_fetch {
set bereq.first_byte_timeout = 1s;
}
sub vcl_deliver {
set resp.http.hits = obj.hits;
set resp.http.uncacheable = obj.uncacheable;
set resp.http.x-cache = req.http.x-cache;
}
sub vcl_hit {
set req.http.x-cache = "HIT";
}
sub vcl_miss {
set req.http.x-cache = "MISS";
}
} -start
client c1 {
txreq
rxresp
expect resp.status == 503
expect resp.http.hits == 0
expect resp.http.uncacheable == false
expect resp.http.x-cache == "MISS"
} -start
delay 0.5
client c2 {
txreq
rxresp
expect resp.status == 503
expect resp.http.hits == 1
expect resp.http.uncacheable == false
expect resp.http.x-cache == "HIT"
} -start
client c1 -wait
client c2 -wait
varnish v1 -expect MAIN.cache_hitmiss == 0
varnish v1 -expect MAIN.cache_miss == 1
varnish v1 -expect MAIN.cache_hit == 1
varnish v1 -expect MAIN.cache_hit_grace == 0
varnish v1 -expect MAIN.busy_sleep == 1
Bugwash: Things seem to work as they should, but we may be calling something a hit (VSC) which is not what users would think of as "hit" since delivery isn't immediate.
Return to this after thinking/looking at the statistics.
I would suggest the following:
- rename
MAIN.busy_sleeptoMAIN.cache_hit_coalesce - add
obj.was_coalescinganalogous toobj.is_hitmissetc.
but regarding the OPs main point, I think that a coalescing (aka waitinglist) hit will need to continue ending up in vcl_hit, because it is that, even if it has to wait.
I would need to dig up my 242-slides presentation from a couple VDDs ago but I'm pretty sure something like that was part of the plan laid out.
I couldn't find such a flag in my presentation.
I would suggest the following:
* rename `MAIN.busy_sleep` to `MAIN.cache_hit_coalesce` * add `obj.was_coalescing` analogous to `obj.is_hitmiss` etc.
If we rename a busy counter, we should rename all of them to reflect their relation to the waiting list so I would prefer something like cache_hit_waitlist and other counters renamed to waitlist_something. See #4073 for the gory details, but "busy" was conflated between busyobj and waiting list and created a confusion in this area.
Regarding obj.was_coalescing I think this is taking the problem from the wrong end and I would prefer something like req.was_waiting to better reflect what went on.
I am also ok with @dridi's suggestion to consistently rename to waitlist instead of coalesce, if that was thought to be more descriptive. My personal opinion is that the waiting list is only the implementation, but the mechanism is coalescing, so I prefer the latter.
Regarding the flag to reside under obj or req, I it is related: req.was_waiting somehow makes sense but sounds very general ("and for what was the request waiting?") while req.was_coalescing in my mind also takes the problem from the wrong end, because it was the object that was the result of coalescing backend requests. On a side note, for restarts I think we would need to clear a request flag to signal exactly during which iteration the waitinglist was entered, and that, again, seems un-pola to me, because then the request might report was_waiting = false when in fact it did before.
So all in all, obj.was_coalescing actually seems to be more descriptive in my mind still, but I can go with whatever.
I think I'm actually against cache_hit_waitlist because <counter>_<specialization> is supposed to represent a subset of <counter> and a waiting list hit is just a hit, so it could very well be a grace hit, and it shouldn't be presented as a sibling of cache_hit_grace.
If we touch one of the busy_* counters, they should probably all use the same waitlist_ or waitinglist_ prefix. For now we should probably leave them alone to avoid breaking setups for no good reason (we don't have VSC aliases for compatibility).
Regarding the original problem of long hit transactions, the same can happen without involving a waiting list. A hit on an object in the BOS_STREAM state (ongoing fetch) can also appear to take a long time in varnishlog. Imagine a backend serving the response headers fast, but taking a long time to produce the response body, or simply a slow or long fetch in general.
Now regarding req.<new_waitinglist_flag_tbd>, something potentially more useful could be instead a new variable req.waiting_list_time telling you the amount of time spent in a waiting list. You could probably already craft this by storing now in vcl_recv and computing the difference in vcl_{hit,miss,pass}. Keeping track of a duration can also translate fairly easily to a boolean:
if (req.waiting_list_time > 0) {
# ...
}
I'm aware that a DURATION called "something_time" can be confusing, I'm just not making an effort to find a name for the concept.