Varnish 7.1.1 panic - cache/cache_wrk.c:629
Today during high load Varnish dropped all connections and panic.show displays:
Panic at: Tue, 22 Nov 2022 19:58:15 GMT
Wrong turn at cache/cache_wrk.c:629:
Worker Pool Queue does not move
version = varnish-7.1.1 revision 7cee1c581bead20e88d101ab3d72afb29f14d87a, vrt api = 15.0
ident = Linux,5.4.172-90.336.amzn2.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1849122.075794 (mono), 1669147095.298170 (real)
Backtrace:
ip=0x55da1c2b8d30 sp=0x7f0cb4433870 <PAN__DumpStruct+0x110b>
ip=0x55da1c314ab7 sp=0x7f0cb44339b0 <VAS_Fail+0x4d>
ip=0x55da1c2d3da2 sp=0x7f0cb44339f0 <pool_herder+0x66>
ip=0x7f0cba16b1f5 sp=0x7f0cb4433ac0 <pthread_exit+0x280>
errno = 110 (Operation timed out)
argv = {
[0] = \"varnishd\",
[1] = \"-F\",
[2] = \"-f\",
[3] = \"/data/varnish.vcl\",
[4] = \"-s\",
[5] = \"malloc,4096m\",
[6] = \"-p\",
[7] = \"default_ttl=3600\",
[8] = \"-p\",
[9] = \"default_grace=3600\",
[10] = \"-p\",
[11] = \"feature=+esi_ignore_https\",
[12] = \"-p\",
[13] = \"feature=+esi_disable_xml_check\",
[14] = \"-p\",
[15] = \"http_resp_hdr_len=42000\",
}
pthread.self = 0x7f0cb4433b38
pthread.name = (pool_herder)
pthread.attr = {
guard = 8192,
stack_bottom = 0x7f0cb4413000,
stack_top = 0x7f0cb4433b08,
stack_size = 133896,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
std = {0x7f0cb4e59e20, Varnish 7.1.1 7cee1c581bead20e88d101ab3d72afb29f14d87a, 0.0},
},
pools = {
pool = 0x7f0cb9f9b560 {
nidle = 1,
nthr = 5000,
lqueue = 2
},
pool = 0x7f0cb9f9b690 {
nidle = 0,
nthr = 5000,
lqueue = 2
},
},
I appreciate if someone can help me understand what can be potential source of this issue.
Firstly, the intention behind this panic is a good one - to improve your system's uptime by restarting the varnish worker process if something goes terribly wrong. The respective parameter is thread_pool_watchdog. The potential reasons for the watchdog barking are numerous: Your varnish worker did not run for some time, system resources are exhausted (in particular, memory), your vm got migrated. We really can't know.
Other than that, this could be a bug in the watchdog code, but we have not seen anything like that for long, so at this point I would be tempted to point to "other reasons".
On a related issue, the stack trace points to a potential issue in our panic handler, but that is secondary and non-causal.
@Vormillion can you please let us know which binaries you are using? If you built varnishd yourself, please make the binary available for download somewhere.
FWIW, the PAN__DumpStruct+0x110b offset looks like it was not actually happening in PAN__DumpStruct, because that function is just 0x2a0 bytes in length on my system.
On a related issue, the stack trace points to a potential issue in our panic handler, but that is secondary and non-causal.
pan_pool() is last in pan_ic() and looks like it has completed, so the top stack frame could just be the abort() in pan_ic() .