varnish-cache ban-lurker kills child due to Missing errorhandling code in ban

Expected Behavior

child process should not die after ban-lurker processing ban operations.

Current Behavior

Child process dies and user facing is like all cache has been flushed.

Bans contained those URL regexps:

"^\/(v[^\/]\/)(books|articles)(\.json)?\?(.+&)(parameter_list%5B%5D=value1)(&.+)*$",
"^\/(v[^\/]\/)books\/2001-a-space-oddissey\/data(\.json)?(\?.)?$",
"^\/(v,[^\/]\/)lists\/(--all|great_of_all_time)\/contents(\.json)?\?(.+&)(element_type=(author|authors))(&.+)*$"

These regexps where the ones that made the child process die. We have reproduced it successfully. Also reproduced in Varnish 7.7 so the issue has not been fixed.

Child (13697) Panic at: Wed, 30 Jul 2025 12:34:28 GMT
Missing errorhandling code in ban_evaluate(), cache/cache_ban.c line 574:
  Condition(rv >= -1) not true.
version = varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1, vrt api = 19.0
ident = Linux,6.1.0-37-cloud-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1053305.025056 (mono), 1753878868.729832 (real)
Backtrace:
  0x55c49755bcfe: /usr/sbin/varnishd(+0x5ccfe) [0x55c49755bcfe]
  0x55c4975dc2c5: /usr/sbin/varnishd(VAS_Fail+0x45) [0x55c4975dc2c5]
  0x55c497535867: /usr/sbin/varnishd(+0x36867) [0x55c497535867]
  0x55c497537949: /usr/sbin/varnishd(ban_lurker+0x4b9) [0x55c497537949]
  0x55c497583d01: /usr/sbin/varnishd(+0x84d01) [0x55c497583d01]
  0x7ffb934a81f5: /lib/x86_64-linux-gnu/libc.so.6(+0x891f5) [0x7ffb934a81f5]
  0x7ffb9352889c: /lib/x86_64-linux-gnu/libc.so.6(+0x10989c) [0x7ffb9352889c]
errno = 110 (Connection timed out)
argv = {
  [0] = \"/usr/sbin/varnishd\",
  [1] = \"-F\",
  [2] = \"-a\",
  [3] = \"0.0.0.0:7000\",
  [4] = \"-T\",
  [5] = \"127.0.0.1:7001\",
  [6] = \"-f\",
  [7] = \"/etc/varnish/varnish.vcl\",
  [8] = \"-S\",
  [9] = \"/etc/varnish/secret\",
  [10] = \"-p\",
  [11] = \"vcc_allow_inline_c=on\",
  [12] = \"-p\",
  [13] = \"http_req_hdr_len=20000\",
  [14] = \"-p\",
  [15] = \"http_resp_hdr_len=20000\",
  [16] = \"-p\",
  [17] = \"feature=+esi_disable_xml_check\",
  [18] = \"-p\",
  [19] = \"max_esi_depth=10\",
  [20] = \"-p\",
  [21] = \"feature=+esi_ignore_other_elements\",
  [22] = \"-p\",
  [23] = \"thread_pool_stack=192k\",
  [24] = \"-p\",
  [25] = \"ban_lurker_sleep=0.005\",
  [26] = \"-p\",
  [27] = \"ban_lurker_batch=2000\",
  [28] = \"-s\",
  [29] = \"malloc,24576m\",
}
pthread.self = 0x7ffb84dff6c0
pthread.name = (ban-lurker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7ffb84600000,
  stack_top = 0x7ffb84e00000,
  stack_size = 8388608,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
  var = {0x7ffb92ed4150, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
  querystring = {0x7ffb92ed41c0, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 19.0},
  std = {0x7ffb92ed4230, Varnish 7.5.0 eef25264e5ca5f96a77129308edb83ccf84cb1b1, 0.0},
},
pools = {
  pool = 0x7ffb8bdfd000 {
    nidle = 91,
    nthr = 100,
    lqueue = 0
  },
  pool = 0x7ffb8bdfd640 {
    nidle = 93,
    nthr = 100,
    lqueue = 0
  },
},

Possible Solution

Understand why VRE_match returns an error, and fix it.

We are using Varnish 6 in production without issues, so we suspect its related to the new PCRE engine.

Steps to Reproduce (for bugs)

Load the cache with 4M objects
Send the mentioned bans

Sorry I can't provide more information for security reasons.

Context

Cache was loaded by 4M objects. Tried to reproduce it with just 10k objects and it did not happen. We only could reproduce it after loading with 4M objects and then sending the mentioned bans.

Other bans were being sent at all time, just those seem to break VRE_match.

Varnish Cache version

varnish-7.5.0 revision eef25264e5ca5f96a77129308edb83ccf84cb1b1

Operating system

Debian 12

Source of binary packages used (if any)

No response

Aug 06 '25 11:08 beltrachi

we should pick up #4167 again to add an assert telling us the actual return value.

@beltrachi could you check if your machine is running out of memory?

Aug 12 '25 09:08 nigoroll

@beltrachi could you check if your machine is running out of memory?

The machine had 64GB available, and metrics showed usage of 18GB used at that point in time. Did not look like it was a memory issue. Let me know if I can provide more details.

We have Varnish 6.x running on the same memory and with no issue for 2 years on an m5.4xlarge. Varnish 7.7 was running on m7i.4xlarge for some days.

Aug 12 '25 10:08 beltrachi

We have reproduced it successfull

@beltrachi can you provide more details on how to reproduce this ? I am trying to understand the exact reason for the pcre2_match failure but I'm unable to reproduce so far.

Aug 19 '25 08:08 walid-git

bugwash has discussed options to address this, and the safest option seems to be to handle all pcre errors < -1 as "ban the object" (match for ~, no match for !~). But before we add such potentially impactful change, I^Wwe really want to understand which error we run into, so we will make the necessary changes to make the panic more helpful and then ask you @beltrachi to re-run with that change.

Aug 25 '25 13:08 nigoroll

Hi!

@nigoroll that sounds great.

@walid-git we've been thinking on how to help you reproduce it without giving too much internal information but filling 4GB of cache is a bit of a challenge. If we have any progress I'll let you know.

Is there any example of cache load script that we could use as template in this case?

Aug 25 '25 14:08 beltrachi

Is there any example of cache load script that we could use as template in this case?

I use this for burn-in testing.

Aug 25 '25 17:08 nigoroll

Hi,

we did not successfully get a testing script to reproduce it, so we decided to focus on migrating to Varnish 7.

Right now we are not using bans any more to expire cache objects, so from your side this issue is not an issue any more. I'll close it.

Thank you all for your support ❤️

Jan 07 '26 08:01 beltrachi

ban-lurker kills child due to Missing errorhandling code in ban_evaluate()

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Varnish Cache version

Operating system

Source of binary packages used (if any)