trafficserver icon indicating copy to clipboard operation
trafficserver copied to clipboard

Stop write_fail actions from going to origin if parentage with go_direct=false

Open ezelkow1 opened this issue 4 years ago • 3 comments

The way the current write_fail action code is written I believe it allows children caches of parents to bypass parents in the case where they are not able to obtain a write lock on an object. This would usually happen during a thundering herd event when an upstream origin has issues and so there is already one request waiting on a response, no headers have been received, so there is no object in cache but the write lock is blocked.

This prevents RWR from coalescing requests since there are no headers. There are already options to return a failure or stale in these instances however the default option appears to allow children caches to completely bypass their parents and go directly to the origin even if their parent line specifies go_direct=false. The write_lock failure should take parentage/go_direct into account since right now this is the only method I know of that will completely bypass parents

ezelkow1 avatar Dec 14 '21 23:12 ezelkow1

If possible, attach the ATS configs and HTTP message captures with last hop and next hop for when this occurs, or an Au test.

ywkaras avatar Dec 15 '21 18:12 ywkaras

@traeak may have some logs, later, since he can reproduce this in a lab. I believe it comes down to this section, https://github.com/apache/trafficserver/blob/master/proxy/http/HttpTransact.cc#L3269

In the else case there we have exhausted all write lock attempts at which point I believe that how_to_open_connection will just send the sm direct to the origin, bypassing parents. Its just an educated guess at this point though

This is with proxy.config.http.cache.open_write_fail_action INT 0, you need to generate contention on the write lock that will fail by having an origin stall upstream of a parent, but still accept connections so that parentage will not mark down the origin. It basically creates a scenario where parents cannot mark down, even though no data is received, requests stack up and run through read_retries, then write_retries and eventually go to the origin

ezelkow1 avatar Dec 15 '21 18:12 ezelkow1

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.

github-actions[bot] avatar Apr 03 '24 15:04 github-actions[bot]