varnish-cache VSL client panics on AN(keep) in vslq

An application that uses the VSL API has intermittently experienced panics in production due to this assertion in vslq_next(), called by VSLQ_Dispatch().

The site is actually running Varnish 6.6, but the relevant part of vslq_next() appears to be unchanged in master:

        vtx = vtx_lookup(vslq, vxid);
        keep = tag != SLT_Begin || vslq_candidate(vslq, ptr);
        if (vtx == NULL && tag == SLT_Begin && keep) {
                vtx = vtx_add(vslq, vxid);
                AN(vtx);
        }
        if (vtx != NULL) {
                AN(keep);
                r = vtx_append(vslq, vtx, &c->rec, len);
                if (r == vsl_more)
                        vtx_scan(vslq, vtx);
        }

It seems possible to me that keep may be zero in the vtx != NULL clause if tag is SLT_Begin and vslq_candidate() returned false. The VSL client uses request grouping (VSL_g_request), which narrows down the possible reasons for vslq_candidate() to return false, none of which seem to be out of the question, as far as I can tell.

The panic is rare. The client reads millions of transactions per day, and the problem has been observed a handful of times in recent months. Unfortunately I don't know a way to reproduce it.

I don't see an obvious way for the VSL client to avoid this situation. The motivations for the assertion are not quite clear to me, but if keep == 0 indicates an intractable problem, would it be possible for vslq_next() (and hence VSLQ_Dispatch()) to return an error status when it happens, rather than an assertion? A panic leads to a fairly significant loss of data, whereas an error for the transaction would only lead to loss of data for that transaction.

Jan 17 '22 11:01 slimhazard

A possible clue is that in the most recent incident, the logs show the panic happening immediately after the client got two "duplicate link" errors from VSL.

Jan 17 '22 11:01 slimhazard

Edited because I stupidly was considering the sense of AN() backward (the problem is that keep is 0, not that it's non-zero).

Jan 17 '22 13:01 slimhazard

Happy new year @slimhazard!

I think I understand what's happening here and the assertion is here on the wrong assumption that transactions are only ever added where vsl_candidate() is called. Normally we see the Begin tag before the Link one so that's probably why it is so rare.

A possible clue is that in the most recent incident, the logs show the panic happening immediately after the client got two "duplicate link" errors from VSL.

This on the other hand is a bit more worrying.

Edited because I stupidly was considering the sense of AN() backward (the problem is that keep is 0, not that it's non-zero).

I assumed you confused AN() with AZ().

The fix is very simple: remove AN(keep);. I'll try to find a sequence of events for a reproducer.

Jan 17 '22 13:01 dridi

Happy new year @Dridi!

Normally we see the Begin tag before the Link one so that's probably why it is so rare.

The site is using the pesi VDP for parallel ESI, which might bring that about (although it's still unusual).

This on the other hand is a bit more worrying. I assumed you confused AN() with AZ().

Very worrying indeed, and no better explanation than brain fog. ("assert not null", so it asserted because it's not null ... um, duh.)

Jan 17 '22 13:01 slimhazard

I may have misunderstood again. I couldn't say for certain that the duplicate links are related to the assertion failure, they might have just coincidentally happened right before the crash. Of course that may be cause for concern about how VSL is working. But the effect for the VSL client is much less serious. It can cope with errors returned by VSLQ_Dispatch() and move on, maybe losing a bit of data, but that's negligible and tolerable compared to all of the log transactions it handles correctly.

Jan 17 '22 13:01 slimhazard

No offense meant, but I find this specific point less worrying if it stems from your vmod_pesi, there's a chance that the duplicate link comes from there.

Jan 17 '22 13:01 dridi

I forgot to ask the obvious, what does your VSL API usage? Which options do you use? A varnishlog command equivalent with the proper dispatch options is probably enough:

varnishlog -g foo -q bar -[bcE] -[...]

Jan 17 '22 13:01 dridi

I have not fully devoured this ticket, but FTR, I do not think there is anything wrong in pesi, but if anyone sees any issue, please let me know. Thank you. https://code.uplex.de/uplex-varnish/libvdp-pesi/blob/master/src/vdp_pesi.c#L346

Jan 17 '22 14:01 nigoroll

@Dridi, the equivalent varnishlog flags are:

-c -I 'VCL_log:^foo ' -I Timestamp:^Resp: -g request -q 'VCL_log ~ "^foo "' -i VSL

The VCL_log contents are set in VCL, and start with a string that says "this is for the log reader" (the string isn't really "foo"), hence the query. The include filters limit the contents of the transactions to those log lines, the timestamp, and VSL diagnostics.

Jan 17 '22 14:01 slimhazard

I do not think there is anything wrong in pesi

@nigoroll I don't think so either, and I don't think the conversation so far points to that. pesi can make it more common for something like Begin and Link appearing in the order that's the opposite of what's typical with serial ESI. But on my understanding VSL is intended to deal with that and does, since it can happen in other contexts as well, depending on factors such as when log contents are flushed, etc.

Jan 17 '22 14:01 slimhazard

FYI, the "duplicate link" error is seen by the VSL client without panics. I've seen it before, I believe before we ever started with pesi.

OTOH I just noticed to my surprise that there are two panics reported in the logs. The second one is also caused by the same assertion, and "duplicate link" is reported immediately before that (just one this time).

So we know that "duplicate link" does not necessarily lead to the panic. But two panics with "duplicate link" just before them suggest that the problems are related, maybe a common underlying cause.

About the SLT_VSL logs -- the VSL client reports their contents in its own log, increments a counter, and continues the loop that reads VSLQ_Dispatch(). No other error compensation.

Originally I was hoping to eliminate all causes of SLT_VSL errors, but we decided eventually that it's futile. The site has large ESI trees, which with request grouping keeps a lot of log transactions open, until all of them complete. If one of them includes a long backend fetch, eventually the transaction limits are reached. That gets you "store overflow" and "link too late" in SLT_VSL. We have increased the -L and -T parameters a number of times, but still the maxima get hit.

Looking at stats recorded in the app's logs just now, I saw: 1.25 billion transactions read since startup, and SLT_VSL seen 91,000 times - 0.007%. Most of the 91,000 came in a single spike, almost all of them "store overflow". The low percentage is typical for the app, so we've taken it as an acceptable loss.

Jan 17 '22 15:01 slimhazard

Could this scenario of the vxid wrap hypothesis happen?

L_opt or T_opt limit is reached (store overflow)
- An ESI parent vtx is forced complete
A sub-request's Begin record is found and link too late is issued because the parent is complete (previous step)
Because the link is never established, the sub-requests's vtx stays alive after the parent's vtx is retired
time passes, vxid wraps
a request with the orpahaned vtx'es vxid comes in and, because of request grouping, vslq_candidate() decides to keep the transaction with SLT_Begin seen.
https://github.com/varnishcache/varnish-cache/blob/5598410656eaa2149d31af8fd421faea5a6f45f8/lib/libvarnishapi/vsl_dispatch.c#L1372 is hit

Jan 24 '22 12:01 nigoroll

@nigoroll here are the SLT_VSL messages in the application log in which two panics are recorded (both due to the same assertion):

Before the first panic, there is a spike of "store overflow" and "link too late" at an estimated 119 hours before the panic. "Estimated" because the log doesn't have timestamps, but it does have monitoring output that I know is logged every 30 seconds, so we can make inferences from the line numbers.

After that there are two "illegal link vxid" messages, at about 115 and 35 hours before the panic.

The next VSL errors are the two "duplicate links" immediately before the first panic, as mentioned above.

The second panic happened about 36 hours after the first one. Just after the first panic and app restart, "duplicate link" and "duplicate begin" appear a few times, but I suspect it has something to do with the app restarting, in the midst of the log so to speak.

Between the restart and the second panic, the only VSL error messages are a group of 15 "duplicate links" and "duplicate begins" about and hour and 15 minutes before the panic, and as mentioned above another "duplicate link" immediately before the panic.

After the second panic, there are a couple of "illegal link vxid" messages before the log ends, I believe due to regular log rotation.

@nigoroll could the scenario you have in mind be signaled by "duplicate link" and "duplicate begin"? If "store overflow" and "link too late" have anything to do with the problem, they were not observed as SLT_VSL messages at any time close to either of the panics, none at all between the first and second panic. In contrast, "duplicate link" and "duplicate begin" appear shortly before both of the panics.

Jan 24 '22 12:01 slimhazard

Edited because I believe I was doing the math from line numbers wrong, dividing when I should multiplied. On that computation, the timings of these events were much further apart. This reconstruction of time from log lines is trickier than I thought and I might still have it wrong. But we can be sure of the order of events.

Jan 24 '22 13:01 slimhazard

I think the other duplicate messages could also be symptoms of the orphaned vtx, yes.

Jan 24 '22 13:01 nigoroll

I have been pondering possible ways to detect and delete postulated orphan vtxes. I think one way would be to count the total number of Begin records seen (as 64bit) and associate with each vtx that number. This way we can set a threshold in terms of vxids seen. We might also want to switch from the current 30bit to 62bit vxids, in which case we would only hit the duplicate vxids for a varnishd restart (which we already detect differently).

Jan 24 '22 15:01 nigoroll

It turns out that the panic appears to be happening more frequently than we realized. There has been an average of approximately one coredump per day during the last two weeks (14 from January 12th to 24th). The occurrences are irregular -- one day had three coredumps on three different systems, today there were two on one system.

Relatively speaking, that's still a small number compared to the total number of transactions seen by the VSL client (millions per day). But considerably less rare than I thought at first.

Jan 24 '22 16:01 slimhazard

I believe the events @nigoroll describes in https://github.com/varnishcache/varnish-cache/issues/3775#issuecomment-1020029452 is a plausible explanation for the behaviour described here.

It should be noted that the parent <-> child vtx relationships in the log can be created from either end. The parent creates children through SLT_Link, and the children establishes parents through SLT_Begin. This in order to function regardless of the order the child or parent log records are observed. (The entire dispatch system is written to be fault tolerant as a debugging aid, maybe to a fault). For the events to give this problem though, I believe it is necessary that the parent ends up skipping the SLT_Link records due to being force evicted, thus causing the relationship to be established once the child record SLT_Begin is processed.

I'm not sure how to resolve this though. One thing is that the assert itself is wrong, making assumptions about the order things were observed which it can't. Though "ignoring" it would then end up causing the child vtx to have to timeout before being released, which holds buffers longer than needed.

Another approach would be to remember force evicted id's to help resolve it faster. But that begs the question for how long should they be remembered.

Jan 31 '22 14:01 mbgrydeland

@mbgrydeland thank you for your feedback. I think I pretty much got to a similar conclusion as you. Meanwhile, I have written some code for vmod_vtc to generate arbitrary vsl (see 0c96fc6597c3ba6da057bb43acb7f677e6b776f8), but had to stop and switch tasks while being under the impre^Willusion of having got close to reproducing the issue. Regarding a solution, I currently think it might be possible to handle transactions with missing children similar to incomplete ones.

Jan 31 '22 14:01 nigoroll

We have new logs of the panic with the same cause, now with timestamps. Again a series of SLT_VSL diagnostics appear within ten seconds before the panic.

10 duplicate link messages
12 duplicate begin
35 link too late, 11 of them in the same second as the panic

@nigoroll that fits very well with your hypothesis above, this time the SLT_VSL messages match up with what you expected.

Feb 02 '22 10:02 slimhazard

FYI: I am back at this issue and have spent most of the day today just to realize that I was over-confident: I had expected a vtc reproducer to be easy with the additions I have committed, but I have not yet succeeded. So obviously I must be doing something wrong and/or have an incomplete understanding of the problem.

Feb 07 '22 19:02 nigoroll

I am (almost) out of ideas. The short version: I can show that the assertion failure can be caused by a transaction being vtx_force()d and, consequently, not being completed for lack of an SLT_Begin. But I can not explain why such an incomplete transaction would not be subject to a timeout.

Yet I do have a new hypothesis.

The long version:

This vtc shows the assertion we are looking for, and a similar scenario (not given here) can be used to trigger the link too late which @slimhazard is reporting:

varnishtest "Test VSL esi subrequests vs. vtx limit"

varnish v1 -vcl {
	backend proforma none;
} -start

logexpect l1 -v v1 -g request -m -L 4 {
# wait forever
	expect *  999 End
} -start

varnish v1 -vcl {
	import vtc;

	backend proforma none;

	sub vcl_init {
		vtc.vsl_replay(s="""
1001 Begin           c req 1000 rxreq

# vtx_force() out 1001
1 Begin           c req 1000 rxreq
2 Begin           c req 1000 rxreq
3 Begin           c req 1000 rxreq
4 Begin           c req 1000 rxreq

# 1001 logging continues as normal, but is already pushed out, so it
# gets vtx_add()ed again as VSL_t_unknown, waiting for an SLT_Begin
1001 Link            c req 1002 esi
1001 End             c

1002 Begin           c req 1001 esi
1002 ReqURL          c /
1002 End             c

3 End c

# at this point:
# n_outstanding: 4
# incomplete list: 4, 1001
# tree: 3, 4, 1001, 1002
#
# -> 1002 is complete on tree

4 End c

# time passes, vxid wraps
1002 Begin c sess 0
""");
	}
}

the failure is (line number differs because of debug I have added locally):

*    diag  0.0 Assert error in vslq_next(), vsl_dispatch.c line 1400:
*    diag  0.0   Condition((keep) != 0) not true. (errno=0 Success)

how it works:

Most importantly, I hope that everyone will confirm that transactions 1001 and 1002 are valid in terms of the generated logs. 1001 has Begin, Link and End,1002 has Begin, ReqURL and End. @slimhazard has confirmed that the issue happens with no overruns, so it is important that a reproducer shows the issue without lost records.
Transaction 1001 is begun and immediately forced out by four other transactions.
The Link record recreates it with SLT_Begin missing, so when we hit 1001 End, it will not be completed
1002 gets completed, but that will not make 1001 ready, so both transactions stay on the tree.
When 1002 gets reused for a different record, we hit our assertion failure.

Now this test case lacks one important detail: In the real world, the time passing before the vxid clash should be considerably longer than the transaction timeout (vut -T argument) such that the incomplete transaction 1001 should time out and be vtx_force()d complete. This, in turn, should make 1001+1002 ready before a new 1002 arrives.

I stared at the code a lot and fail to see how anything like this could be happening, unless @slimhazard s VSL client had a bug to apply a timeout longer than the vxid wraparound time.

As I have (almost) run out of ideas, I would like to share some hypotheses which I looked into, in the hope that these might help spark a heureka moment with someone else:

Could it happen that all linked transactions in a tree become complete without the root transaction becoming ready? I can not see how this could happen, I think vtx_mark_complete() is correct
Could we somehow create a vtx loop? Yes I think so, but only if the emitted logs were wrong to begin with or, again, a transaction was not subject to the timeout.

This brings us to my only remaining hypothesis: Could vxids be reused within a reasonable VSL timeout of, say, 5 minutes?

This seems possible if workers stayed alive for a long time with prolonged periods of inactivity: When a worker needs a vxid for the first time, it acquires a pool of size vxid_chunk = 32768 in VXID_Get(). If the worker does not get destroyed, it holds the pool indefinitely, so if it gets reused after a vxid wrap, a clash is possible.

This brings us back to the point that we probably should go to 64 bit vxids. Or maybe we could time out the vxid pools?

If this new hypothesis was correct, the issue should occur less frequently if thread_pool_min was chosen such that threads actually got destroyed before a vxid wrap could happen. This is contrary to the recommendation to always keep the worker reserve very high in order to minimize overhead at load spikes.

But this hypothesis could also be wrong, it is perfectly possible that even after staring at the code and trying out hypotheses for almost 20 hours last week alone I still did not see a problem in the vsl code.

Feb 13 '22 17:02 nigoroll

In the real world, the time passing before the vxid clash should be considerably longer than the transaction timeout (vut -T argument) such that the incomplete transaction 1001 should time out and be vtx_force()d complete. [...] I stared at the code a lot and fail to see how anything like this could be happening, unless @slimhazard s VSL client had a bug to apply a timeout longer than the vxid wraparound time.

Hmm, I've taken another look and will keep looking. If you can say more about the kind of buggy behavior you have in mind, it would certainly help.

The VSL client doesn't have timeouts in the main loop that calls VSLQ_DIspatch() and operates on its results. On a number of error conditions, it discards data, in most cases calling VSLQ_Flush() to flush pending transactions. But all of that is logged, we would have seen it in the logs.

The app flushes VSL transactions if:

VSM_Status() reports that the Varnish management process changed or was restarted.
the app process is terminated.
the VSL API reports any of the errors vsl_e_overrun, vsl_e_io, or vsl_e_abandon. (vsl_e_write doesn't happen, because the app doesn't use any of the output formatting functions.)
a flush was explicitly requested via a signal.

All of which is logged, and I haven't noticed it.

There are two kinds of errors that cause the app to discard data that it reads from the log: internal memory pools exhausted, or unable to parse log contents. That just means the log line in question is not added to the complete record that is forwarded over the network. It shouldn't impact VSL, and at any rate we haven't seen evidence of errors like that in the log (at least not at any time close to the panics).

If the app sees an SLT_VSL record, it logs its contents and does nothing further. Those are the cases described above.

@nigoroll let me try to re-state your hypothesis in my own words, to see if I've understood:

A worker gets a pool of 32768 VXIDs.
That worker stays idle while the VXIDs wrap all the way around, all 2^30 of them (approximately a billion).
The worker gets used again, re-using a VXID that was recently forced out due to timeout or the -L limit, triggering the panic.

Is that right?

If so, it sounds pretty far-fetched, but I couldn't rule it out. Unfortunately I don't have a sense of how well workers are utilized (if there's a VSC metric that helps to understand that, I don't know it). Some time ago I estimated the requests per day per Varnish at the site such that it might wrap around a billion in about three or four days. It may have gone up since then.

We've had reports that the panics have been happening recently about once a day (very roughly). On the hypothesis, that would suggest that VXIDs wrap around approximately within a day, meaning that traffic is now 3 to 4 times as much as it was. Which I doubt, but we can ask. And it would suggest further that a worker got the VXID chunk, and then was idle most of the day. None of that is out of the question, but it would make this bug a potential all time winner in the weirdness category.

Feb 13 '22 18:02 slimhazard

re @slimhazard

If you can say more about the kind of buggy behavior you have in mind, it would certainly help.

Anything which caused (struct vsl).T_opt to be set to an astronomical value outside the region of, say, some hours. The default is 120 seconds, so the option would have to be set explicitly to a higher value.

To be clear, I am not saying that your code was buggy at all. But as you said, we should be sure about this aspect.

A worker gets a pool of 32768 VXIDs.

That worker stays idle while the VXIDs wrap all the way around, all 2^30 of them (approximately a billion).

The worker gets used again, re-using a VXID that was recently forced out due to timeout or the -L limit, triggering the panic.

On the second point: The worker would not need to stay fully idle all the time. All that matters is that, over time, it uses so few vxids that it still has an "old" vxid cache when vxids wrap and another thread gets the same vxid cache assigned.

On the third point: For this hypothesis (the same vxid space used by more than one worker in parallel), we do not even need to postulate vtxen to be forced out, because basically all assumptions about the referential integrity of the log will break. And actually I had a bit of a hard time explaining all the different error we are seeing under the initial hypothesis (in particular the duplicate begin), but with this hypothesis they all would occur with high probability.

Note that this scenario is not even limited to two workers, over time, more workers could accumulate with conflicting vxid caches.

We've had reports that the panics have been happening recently about once a day (very roughly). On the hypothesis, that would suggest that VXIDs wrap around approximately within a day, meaning that traffic is now 3 to 4 times as much as it was. Which I doubt, but we can ask. And it would suggest further that a worker got the VXID chunk, and then was idle most of the day. None of that is out of the question, but it would make this bug a potential all time winner in the weirdness category.

IIUC, the site has relevant load spikes (user traffic, deployments) plus they are using pesi which causes more worker spikes by design.

Also the new hypothesis only postulates that the issue will not occur before vxids have wrapped at least once. Other than that, there does not need to be any correlation with the vxid wrap frequency.

Feb 14 '22 09:02 nigoroll

Anything which caused (struct vsl).T_opt to be set to an astronomical value outside the region of, say, some hours. The default is 120 seconds, so the option would have to be set explicitly to a higher value.

T_opt is configurable, and is set to 300 (5 minutes). The configuration is the only way to affect T_opt. FWIW L_opt is set to 15000.

Hypothesis: got it. I'm wavering between "yeah, but that would be wild" and the thought that there never really was any guarantee that an existing VXID pool might still be around by the time VXIDs wrap all the way around. With enough traffic and a large enough thread pool, what prevents it? But if so, it makes me wonder why we're seeing this with increasing frequency in recent weeks.

Yes, the site does get occasional load spikes. I don't have the impression that they're out of the ordinary for a typical high-traffic Varnish deployment. But they might be enough,

Feb 14 '22 09:02 slimhazard

FTR 2^62 is about 4.5 quintillion, give or take a hundred quadrillion. Of course that many VXIDs would eventually wrap around as well, if you leave Varnish running until the Sun goes red giant.

Feb 14 '22 10:02 slimhazard

New and very surprising information: the admins at the site thought that the increased number of coredumps (very approximately once per day) might have begun around the time when the VCL workaround for VSV8 was implemented -- set Connection:close in the client response if the request was HTTP/1 and had a request body. They were still running with the workaround until last Wednesday (today is Monday), when they upgraded to 6.6.2.

In the 5 days since then, the VSL client app has not panicked. They have their highest traffic on weekends, especially Sunday evenings, so this has been the toughest test in terms of production load.

@nigoroll and I were certain that an upgrade to 6.6.2 would not resolve the problem. I cannot imagine how the VCL workaround for VSV8 could have brought it about. I would still like to find out if it stays that way for a few more days, to be sure that we have a significant finding (and because it's so hard to believe). But I wanted to add this info before today's bugwash.

If the upgrade to 6.6.2 has really eliminated the problem, then the production issue is resolved. I assume that we'll still want to fix the issue, since nothing about what the workaround did should cause a VSL client to crash. Maybe we'll have a new way to explain and test it, in light of this new and bizarre information. The candidate bugfix might be necessary in any case.

Mar 07 '22 09:03 slimhazard

The site admins have reported that after another weekend, when the request load at the site peaks, there have still been no coredumps by the VSL client. By now we have significant evidence that the problem was eliminated with the upgrade to 6.6.2, and the rollback of the VCL workaround for VSV8. As I understand it, the VSL client hasn't had any panics at all since then.

With that, the production issue is resolved. The question remains as to how to handle this issue for the Varnish project. In my view, it still ought to be addressed, because evidently there's a lurking problem here that hasn't really gone away, and might show itself again in the future.

Although it's completely bewildering, it appears that the VCL workaround ultimately led to the problem. I think that @nigoroll's hypothesis of VXID wraparound and conflicts may be correct, because VXIDs are used for sessions as well, and the VCL workaround led to more sessions. Since it sets Connection:close in responses to requests with a request body, there was less session re-use for further requests, which leads to more new sessions and more VXID consumption.

I don't have exact figures, but it's my understanding that POST requests are relatively significant at the site. It's online retail, so requests with a body are used for sending orders, adding items to wish lists, updating customer info, and so forth. POSTs are surely not the majority of requests, but I don't believe that they are a negligibly small percentage. So I think that the increased use of sessions and hence VXIDs was quite likely. Coredumps didn't happen immediately after new VCL deployment; significant time passed until that happened, which is also consistent with the hypothesis.

The move to 64-bit VXIDs might be enough to resolve the problem. Knowing that the VSV8 workaround in VCL was related to it, astonishing as it seems, suggests ways to test hypotheses and find out for sure what the real culprit was.

Mar 16 '22 11:03 slimhazard

This sounds like the assertion is not only correct, it is also warning us about vxid churn.

Mar 16 '22 16:03 dridi

After over three weeks without any core dumps, the site admins reported that two of them happened again today, and both were cause by the same assertion. This time it was during a load test (fortunately not in production).

They said that this happened while one of the backends was performing poorly, leading to many open backend connections and slow response times.

Obviously, "consistent with a hypothesis" is not the same thing as proving the cause of a problem. That said, IMO this is consistent with @nigoroll's hypothesis that VXID wraparound and conflict led to the panics. The story would go like this:

The VSV8 workaround in VCL led to higher use of session VXIDs, and hence relatively frequent VXID duplicates that led to the assertion.
After a VSV8 fix version was installed and the VCL workaround was rolled back, the panics stopped happening for several weeks. Then during a load test, many connections were opened for a stressed backend, which increased VXID churn and again caused the panic.

Mar 31 '22 18:03 slimhazard

VSL client panics on AN(keep) in vslq_next()