varnish-cache Setting backend timeouts to zero does not wait forever

According to the docs, setting certain timeouts like first_byte_timeout and between_byte_timeout to 0 causes them to disable and wait forever, but this is not the case.

https://github.com/varnishcache/varnish-cache/blob/f4dcf8f02a2fb059dacba4c7bedefccbe9897cd2/include/tbl/params.h#L248-L249

These timeouts zero out, but then get unzeroed here:

https://github.com/varnishcache/varnish-cache/blob/f4dcf8f02a2fb059dacba4c7bedefccbe9897cd2/bin/varnishd/cache/cache_session.c#L333-L334

VTC:

varnishtest "Check first_byte_timeout=0 works"

server s1 {
	rxreq
	delay 2
	txresp
} -start

varnish v1 -arg "-p first_byte_timeout=0" -vcl+backend {
} -start

client c1 {
	txreq
	rxresp
	expect resp.status == 200
} -run

Aug 14 '19 20:08 rezan

Lmk how you want to fix this, im working on some timeout stuff at the moment and can submit a PR. The problem is that a value of zero is special and forces Varnish to read the value from 1 level up in the config. So even if this is fixed, this is unusable from VCL. It might be best to not allow a value of zero at all and force the user to set it very high if they want to actually wait forever.

Aug 14 '19 20:08 rezan

Forever is a long time, should that really be the behaviour ?

Aug 19 '19 07:08 bsdphk

Bugwash consensus: 1msec is what we want for 0 timeouts.

DocFix wanted.

Aug 19 '19 11:08 bsdphk

This ticket was brought to my attention by @walid-git after he unsuccessfully tried to set a zero timeout on a backend definition, to realize that this zero value would simply not act as an override of the related parameter. I don't remember which of connect_timeout, first_byte_timeout and between_bytes_timeout, and it does not matter.

The point is that zero was used for both "no timeout as in non-blocking poll" and "timeout not set".

With #4041 we discussed a similar problem with the perception of unset req.ttl and req.grace and my suggestion was to use NAN to mean unset.

In #4043 I fixed what I considered a bug of a zero pipe_timeout meaning non-blocking instead of disabled. I also introduced a new pipe_task_deadline parameter for which zero must mean that the timeout is disabled to match the lack of timeout before the patch series.

I merged #4043 after a couple bugwash iterations, during which we discussed several aspects of the change, but never questioned the meaning of a zero value disabling the timeout. It went even further as #4043 formalizes in the documentation that zero disables all timeouts (except parameters with a non-zero minimum value).

Nobody remembered the consensus from this thread. I sincerely don't remember my personal opinion back then. I know that today I'd rather allow disabling timeouts with a zero value, or, have a timeout tweak that accepts a special value none or disabled to be explicit. This could map to NAN under the hood on the parameter side, and the VCL counterpart would be the unset action.

I will be the first to warn about disabling timeouts, but one of our principles states:

Provide mechanism, rather than policy.

So I don't see a reason to prevent timeouts from being disabled, in particular task deadlines and future timeouts that don't exist today.

Of course, depending on the outcome, I'm planning to address all the problems highlighted above.

Feb 09 '24 10:02 dridi

(edit: set never from vcl in an example)

Based on the discussion during last bugwash, here's a proposal to clarify the questions from the last comment:

For timeouts, we will always establish a hierarchy, as for example today with connect_timeout and the hierarchy vcl > backend > parameter.

So by this example, unset only applies to vcl and backend, parameters can not be unset. They need to provide some value to ultimately use, even if it is a default (which we already cater for with param.reset).

Besides "unset", we have two other values to clarify: the "zero" timeout and the "never" timeout. In this ticket, we agreed that "zero" would mean "a very small value", and it seems we do not want to change that. On the other end, it seems we want to have a proper definition of "never".

To summarize:

Timeouts receive three special values:

"0" with the internal value of a small epsilon like 1ms
"never" with the internal value inf() or some other huge number and
"unset" with the internal value nan(). Unset would not be valid for parameters.

Strawman A):

varnishadm param.set connect_timeout never
varnishadm param.set first_byte_timeout 0

The respective backend and vcl timeouts would be, by default, unset, so unless VCL contained overrides, we would wait for backend connects forever and expect the first byte with a timeout of 1ms.

Strawman B)

VCL:

backend foo {
  .connect_timeout = 10s;
}

sub vcl_backend_fetch {
  set bereq.backend = foo;
  set bereq.connect_timeout = never;
  # .... complex VCL here, then later
  unset bereq.connect_timeout;
}

The actual connect_timeout is 10s.

Feb 12 '24 15:02 nigoroll

Do we agree that we can't set never from VCL?

Feb 12 '24 16:02 dridi

The side effect of using 0 for "never time out" and NAN for unset was that we could disable a timeout from VCL. I suppose you can always set <some-timeout> = 1y; anyway.

Feb 12 '24 16:02 dridi

Do we agree that we can't set never from VCL?

My proposal was that you could. Sure, this would, for all practical circumstances, be identical to set ... = 1y, the idea here was not to invent anything new, but to hopefully reduce the confusion around the meaning of 0.

Feb 12 '24 16:02 nigoroll

I found one gotcha with the use of INF instead of 0 to disable a timeout: we can't represent infinity in JSON, for param.{show,set} -j commands. We could use null in that case, or the actual string "never".

I think my preference goes to "never".

Feb 26 '24 16:02 dridi

null is better in my opinion, it's less surprising to users and deserializers will thank you for not mixing types

Feb 26 '24 18:02 gquintard

But at the same time "never" would be a valid argument for param.set once the timeout tweak is specialized.

Feb 26 '24 19:02 dridi