druid icon indicating copy to clipboard operation
druid copied to clipboard

Improve client change counter management in HTTP Server View

Open capistrant opened this issue 3 years ago • 0 comments

Fixes #12999.

Description

The first change is to avoid executing resolveWaitingFuturesRunnable if ChangeRequestHistory#addChangeRequests is called with an empty list of changes. It. is not obvious that this would regularly happen anywhere in the code, but to me, it makes sense to avoid resolving all of the waiting futures when there actually weren't any changes.

The second change is to no longer request a counter reset when the client counter matches the. latest counter in ChangeRequestHistory#getRequestsSinceWithoutWait. Resetting the counter in this case seems like an undesirable choice. I'd rather have the client get the same counter and an empty list of changes via a success response, and then re-submit a request for changes, then to get a resetCounter directive and have to make the server replay all of the served segments despite the client already having an up to date set locally. One note about this change is that I wasn't sure how to test it since there is no obvious way via the public API of ChangeRequestHistory to get it to execute ChangeRequestHistory#getRequestsSinceWithoutWait with a counter that matches the latest counter. The response to this could be, why even make the change then? And the answer to that is the logs listed in #12999 indicate we get in this state somehow some way in a live cluster. There is a chance that my first change cleans up the only opportunity to clear the futures without any change in the latest counter vs client counters, but I'm not 100% confident that is the case. I'm open to suggestions on a test for this. One option would be to make resolveWaitingFutures package private and annotated visible for testing only, but I don't really like that pattern

Current Root Cause Analysis (guess)

As of now this is still a guess, but I think the root cause of this state where we reset a valid up to date counter is coming from a race-like condition that occurs when a server is processing multiple changes in close proximity.

Client A: ask Server 1 for changes with Counter.counter = 1 Counter.hash = A. becomes a waitingFuture on Server 1. Server 1: addChangeRequest(Change X). Counter.counter = 2. Counter.hash = B Server 1:execute resolveWaitingFutures for Change X(single threaded executor) Server 1: addChangeRequest(Change Y). Counter.counter = 3. Counter.hash = C Server 1: return Counter.counter = 3. Counter.hash = C Client A: ask Server 1 for changes with Counter.counter = 3. Counter.hash = C. becomes a waitingFuture on Server 1. Server 1 execute resolveWaitingFutures for Change Y. Server 1: reset counter for client A because Counter.counter = 3 from client is >= Counter.counter = 3 on server.

Alternatives

Put the requests right back into waitingFutures instead of returning empty response

In ChangeRequestHistory#resolveWaitingFutures, we could check the counters on waitingFuturesCopy before processing them, and if the counters match, add the item back onto waitingFutures.

Pros

No wasted return with empty list of changes for client to process.

Cons

Adding some complexity and responsibility to resolveWaitingFutures

resolveWaitingFutures with a point in time counter for all waitingFutures

Refactor ChangeRequestHistory#resolveWaitingFutures and ChangeRequestHistory#getRequestsSinceWithoutWait to accept a ChangeRequestHistory.Counter object that is the counter we are using as the "latest" change that we will return to the client. This would prevent waitingFutures from "seeing" changes that came after resolveWaitingFutures gets called, but before they are actually resolved.

Pros

This prevents a funky race-like condition where changes that happen after resolveWaitingFutures is called appear to be sneaking into the change set returned to client. And I think it is this condition that is what is the underlying cause of the state where we have waitingFutures being processed despite having an up to date counter.

Cons

You are always going to process only one change at a time when you are a waitingFuture, even if there are multiple changes coming in close proximity, which I feel is often the case in many environments!

Some thoughts

I feel that it is very common that a single segments-serving instance will process multiple changes in rapid succession. As of now, we are triggering processing of waitingFutures for every change. This can result in rapid client:server interactions between all of the clients maintaining a serverview and the server experiencing the change. From what I've observed in my own test env with some extra debug logs added to ChangeRequestHistory, we end up going through multiple cycles of change request deltas within a single second even. I wonder if this is the best pattern or if we could bake in some batching of change somehow that may result in some added latency for a serverview seeing change, but cut down on the rapid traffic. At the end of the day, perhaps i'm overthinking this. But I wanted to record my thoughts while they are fresh.


Key changed/added classes in this PR
  • ChangeRequestHistory

This PR has:

  • [x] been self-reviewed.
    • [x] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • [x] been tested in a test Druid cluster.

capistrant avatar Aug 31 '22 18:08 capistrant