varnish-cache icon indicating copy to clipboard operation
varnish-cache copied to clipboard

Varnish 7.1.1 panic - cache/cache_wrk.c:629

Open Vormillion opened this issue 3 years ago • 6 comments

Today during high load Varnish dropped all connections and panic.show displays:

Panic at: Tue, 22 Nov 2022 19:58:15 GMT
Wrong turn at cache/cache_wrk.c:629:
Worker Pool Queue does not move
version = varnish-7.1.1 revision 7cee1c581bead20e88d101ab3d72afb29f14d87a, vrt api = 15.0
ident = Linux,5.4.172-90.336.amzn2.x86_64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 1849122.075794 (mono), 1669147095.298170 (real)
Backtrace:
  ip=0x55da1c2b8d30 sp=0x7f0cb4433870 <PAN__DumpStruct+0x110b>
  ip=0x55da1c314ab7 sp=0x7f0cb44339b0 <VAS_Fail+0x4d>
  ip=0x55da1c2d3da2 sp=0x7f0cb44339f0 <pool_herder+0x66>
  ip=0x7f0cba16b1f5 sp=0x7f0cb4433ac0 <pthread_exit+0x280>
errno = 110 (Operation timed out)
argv = {
  [0] = \"varnishd\",
  [1] = \"-F\",
  [2] = \"-f\",
  [3] = \"/data/varnish.vcl\",
  [4] = \"-s\",
  [5] = \"malloc,4096m\",
  [6] = \"-p\",
  [7] = \"default_ttl=3600\",
  [8] = \"-p\",
  [9] = \"default_grace=3600\",
  [10] = \"-p\",
  [11] = \"feature=+esi_ignore_https\",
  [12] = \"-p\",
  [13] = \"feature=+esi_disable_xml_check\",
  [14] = \"-p\",
  [15] = \"http_resp_hdr_len=42000\",
}
pthread.self = 0x7f0cb4433b38
pthread.name = (pool_herder)
pthread.attr = {
  guard = 8192,
  stack_bottom = 0x7f0cb4413000,
  stack_top = 0x7f0cb4433b08,
  stack_size = 133896,
}
thr.req = NULL
thr.busyobj = NULL
thr.worker = NULL
vmods = {
  std = {0x7f0cb4e59e20, Varnish 7.1.1 7cee1c581bead20e88d101ab3d72afb29f14d87a, 0.0},
},
pools = {
  pool = 0x7f0cb9f9b560 {
    nidle = 1,
    nthr = 5000,
    lqueue = 2
  },
  pool = 0x7f0cb9f9b690 {
    nidle = 0,
    nthr = 5000,
    lqueue = 2
  },
},

I appreciate if someone can help me understand what can be potential source of this issue.

Vormillion avatar Nov 25 '22 15:11 Vormillion

Firstly, the intention behind this panic is a good one - to improve your system's uptime by restarting the varnish worker process if something goes terribly wrong. The respective parameter is thread_pool_watchdog. The potential reasons for the watchdog barking are numerous: Your varnish worker did not run for some time, system resources are exhausted (in particular, memory), your vm got migrated. We really can't know.

Other than that, this could be a bug in the watchdog code, but we have not seen anything like that for long, so at this point I would be tempted to point to "other reasons".

On a related issue, the stack trace points to a potential issue in our panic handler, but that is secondary and non-causal.

nigoroll avatar Nov 28 '22 14:11 nigoroll

@Vormillion can you please let us know which binaries you are using? If you built varnishd yourself, please make the binary available for download somewhere. FWIW, the PAN__DumpStruct+0x110b offset looks like it was not actually happening in PAN__DumpStruct, because that function is just 0x2a0 bytes in length on my system.

nigoroll avatar Nov 28 '22 15:11 nigoroll

On a related issue, the stack trace points to a potential issue in our panic handler, but that is secondary and non-causal.

pan_pool() is last in pan_ic() and looks like it has completed, so the top stack frame could just be the abort() in pan_ic() .

nigoroll avatar Nov 28 '22 15:11 nigoroll