ban-lurker kills child due to Missing errorhandling code in ban_evaluate()
Expected Behavior
child process should not die after ban-lurker processing ban operations.
Current Behavior
Child process dies and user facing is like all cache has been flushed.
Bans contained those URL regexps:
- "^\/(v[^\/]\/)(books|articles)(\.json)?\?(.+&)(parameter_list%5B%5D=value1)(&.+)*$",
- "^\/(v[^\/]\/)books\/2001-a-space-oddissey\/data(\.json)?(\?.)?$",
- "^\/(v,[^\/]\/)lists\/(--all|great_of_all_time)\/contents(\.json)?\?(.+&)(element_type=(author|authors))(&.+)*$"
These regexps where the ones that made the child process die. We have reproduced it successfully. Also reproduced in Varnish 7.7 so the issue has not been fixed.
Child (13697) Panic at: Wed, 30 Jul 2025 12:34:28 GMT
Missing errorhandling code in ban_evaluate(), cache/cache_ban.c line 574:
Condition(rv >= -1) not true.
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0
ident = Linux,6.1.0-37-cloud-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1053305.025056 (mono), 1753878868.729832 (real)
Backtrace:
0x55c49755bcfe: /usr/sbin/varnishd(+0x5ccfe) [0x55c49755bcfe]
0x55c4975dc2c5: /usr/sbin/varnishd(VAS_Fail+0x45) [0x55c4975dc2c5]
0x55c497535867: /usr/sbin/varnishd(+0x36867) [0x55c497535867]
0x55c497537949: /usr/sbin/varnishd(ban_lurker+0x4b9) [0x55c497537949]
0x55c497583d01: /usr/sbin/varnishd(+0x84d01) [0x55c497583d01]
0x7ffb934a81f5: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7ffb934a81f5]
0x7ffb9352889c: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7ffb9352889c]
errno = 110 (Connection timed out)
argv = {
[0] = \"/usr/sbin/varnishd\",
[1] = \"-F\",
[2] = \"-a\",
[3] = \"0.0.0.0:7000\",
[4] = \"-T\",
[5] = \"127.0.0.1:7001\",
[6] = \"-f\",
[7] = \"/etc/varnish/varnish.vcl\",
[8] = \"-S\",
[9] = \"/etc/varnish/secret\",
[10] = \"-p\",
[11] = \"vcc_allow_inline_c=on\",
[12] = \"-p\",
[13] = \"http_req_hdr_len=20000\",
[14] = \"-p\",
[15] = \"http_resp_hdr_len=20000\",
[16] = \"-p\",
[17] = \"feature=+esi_disable_xml_check\",
[18] = \"-p\",
[19] = \"max_esi_depth=10\",
[20] = \"-p\",
[21] = \"feature=+esi_ignore_other_elements\",
[22] = \"-p\",
[23] = \"thread_pool_stack=192k\",
[24] = \"-p\",
[25] = \"ban_lurker_sleep=0.005\",
[26] = \"-p\",
[27] = \"ban_lurker_batch=2000\",
[28] = \"-s\",
[29] = \"malloc,24576m\",
}
pthread.self = 0x7ffb84dff6c0
pthread.name = (ban-lurker)
pthread.attr = {
guard = 4096,
stack_bottom = 0x7ffb84600000,
stack_top = 0x7ffb84e00000,
stack_size = 8388608,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
var = {0x7ffb92ed4150, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
querystring = {0x7ffb92ed41c0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
std = {0x7ffb92ed4230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
},
pools = {
pool = 0x7ffb8bdfd000 {
nidle = 91,
nthr = 100,
lqueue = 0
},
pool = 0x7ffb8bdfd640 {
nidle = 93,
nthr = 100,
lqueue = 0
},
},
Possible Solution
Understand why VRE_match returns an error, and fix it.
We are using Varnish 6 in production without issues, so we suspect its related to the new PCRE engine.
Steps to Reproduce (for bugs)
- Load the cache with 4M objects
- Send the mentioned bans
Sorry I can't provide more information for security reasons.
Context
Cache was loaded by 4M objects. Tried to reproduce it with just 10k objects and it did not happen. We only could reproduce it after loading with 4M objects and then sending the mentioned bans.
Other bans were being sent at all time, just those seem to break VRE_match.
Varnish Cache version
varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1
Operating system
Debian 12
Source of binary packages used (if any)
No response
we should pick up #4167 again to add an assert telling us the actual return value.
@beltrachi could you check if your machine is running out of memory?
@beltrachi could you check if your machine is running out of memory?
The machine had 64GB available, and metrics showed usage of 18GB used at that point in time. Did not look like it was a memory issue. Let me know if I can provide more details.
We have Varnish 6.x running on the same memory and with no issue for 2 years on an m5.4xlarge. Varnish 7.7 was running on m7i.4xlarge for some days.
We have reproduced it successfull
@beltrachi can you provide more details on how to reproduce this ? I am trying to understand the exact reason for the pcre2_match failure but I'm unable to reproduce so far.
bugwash has discussed options to address this, and the safest option seems to be to handle all pcre errors < -1 as "ban the object" (match for ~, no match for !~). But before we add such potentially impactful change, I^Wwe really want to understand which error we run into, so we will make the necessary changes to make the panic more helpful and then ask you @beltrachi to re-run with that change.
Hi!
@nigoroll that sounds great.
@walid-git we've been thinking on how to help you reproduce it without giving too much internal information but filling 4GB of cache is a bit of a challenge. If we have any progress I'll let you know.
Is there any example of cache load script that we could use as template in this case?
Is there any example of cache load script that we could use as template in this case?
I use this for burn-in testing.
Hi,
we did not successfully get a testing script to reproduce it, so we decided to focus on migrating to Varnish 7.
Right now we are not using bans any more to expire cache objects, so from your side this issue is not an issue any more. I'll close it.
Thank you all for your support ❤️