stellar-core icon indicating copy to clipboard operation
stellar-core copied to clipboard

add an option to run overlay operations in the background

Open marta-lokhova opened this issue 1 year ago • 1 comments

First stab at running some of overlay processing in the background. At a high level, the change splits processing of raw inbound/outbound data and higher level business logic of the application. The rationale for the split is that this way we actually get a more-or-less clear cut of responsibilities between overlay and the rest of the application (SCP, tx queue, etc).

In this change the following is moved to the background: socket reads/writes, XDR decoding, HMAC verification, signature verification of SCP messages, tx hashing, flow control outbound broadcasting. Now that we have BL snapshotting available, we can go further and do transaction validation in the background, and possibly exit early to avoid additional burden on the scheduler and the tx queue. For the sake of doing things iteratively, I'd rather do that in a subsequent PR as this change is already quite big.

There are a couple of design choices in this PR that attempt to de-risk it as much as possible. For simplicity, there's a single "overlay lock" and a single "overlay thread". Single lock simplifies synchronization and removes a whole class of deadlock bugs. A single thread seems good enough for now, as our traffic patterns are fairly heavily throttled, so most of the time we're actually idle, but periodically we receive bursts of data that main thread needs to process as fast as possible. So the value of many threads is not clear at the moment given that we're still bottle-necked by the main thread, which runs SCP that keeps the node in sync.

The code is structured such that everything before we get to a valid StellarMessage is managed by the background thread, and the business logic is done on the main thread. When main thread is done processing, some of the outbound processing, like flow control + asio queuing, is also done in the background.

I wanted to get the initial version out to get review going and kick-off discussions on de-risking these changes further, as well as minimizing the impact when the experimental flag is off.

marta-lokhova avatar Mar 21 '24 18:03 marta-lokhova

Thanks all for the feedback! I pushed a refreshed version of the PR. Working through a few remaining items, like new unit tests and a small remaining todo to make peer's string address thread-safe, but here's an overview of the updates:

  • Updates to peer API: explicitly require most public functions to be main-thread only, audit the rest for thread-safety.
  • Synchronize peer state with an explicit fine-grained lock, remove global locking to make it easier to reason about synchronization. As part of this, add getters and setters for Peer’s state, that all require a lock guard, hinting to the caller that the lock is needed. Hide any other access to state outside of peer. If callers need to access it, they can pass a lambda and let Peer handle the synchronization.
  • Flow control: remove dependency on Peer completely by removing the send callback. The goal is to prevent flow control from ever calling any method in Peer. This is needed since we have two locks now: Peer’s state lock and flow control lock, they have to be ordered, so the order of locking should always be peer lock followed by flow control. This way we can guarantee the two new locks never deadlock.
  • Get rid of any locking at the peer lists layer: the problem with synchronizing peer lists is that it bleeds into too many subsystems past overlay, so it's hard to reason about and review. The updated design avoids the need to synchronize peer lists completely: only the main thread maintains peer lists and officially "drops" peers. The background thread asynchronously watches for any state changes, and closes the socket. Note that this pattern already exists in master today, where we delay the actual socket shutdown on the main thread, while the Peer is dropped from the OverlayManager point of view, so I'm just re-using it but in the background.

marta-lokhova avatar May 09 '24 19:05 marta-lokhova

r+ 894ef3b33f997e4c3214079461d3096f3a6d84d2

graydon avatar May 29 '24 23:05 graydon

@latobarita: retry

graydon avatar May 29 '24 23:05 graydon

@latobarita: retry

graydon avatar May 30 '24 03:05 graydon