Segment loading: Allow cancellation and prioritization of load queue items

Open kfaraz opened this issue 3 years ago • 0 comments

Fixes multiple items in #12881

Description

This PR lays the ground work to allow load queue to safely have an unlimited number of items, and thus eventually phase out maxSegmentsInNodeLoadingQueue and replicationThrottleLimit.

Load queue is already allowed to have unlimited items (by setting maxSegmentsInNodeLoadingQueue = 0) but this leads to

each coordinator run taking a very long time
poor assignments which take several more runs to be rectified

Changes

Classes to review

SegmentLoader
LoadRule, BroadcastDistributionRule
SegmentStateManager
LoadRuleTest
LoadQueuePeon: http and curator

Behavioral changes

Change	Motivation
Both loaded and loading items count towards replication.	Allow coordinator to take corrective action of removing superfluous replicas without waiting for them to be fully loaded.
Load, drop or move operations can be cancelled.	- Allow move of loading items from queue of one server to another. - Allow coordinator to take corrective actions quickly.
During tier shift, always maintain the currently configured level of replication, no matter which tier it happens to be on.	- Queue drop of unneeded segments as soon as possible, thus allowing faster decommission of servers and freeing up disk space to load new segments. - Always maintain target level of replication, thus ensuring that segment read concurrency does not suffer
LoadQueuePeon can distinguish between load of a primary, a replica or a load required for balancing.	- Allow prioritization of items, which becomes important if load queue size is unlimited. - Avoid considering balancing items in load queue as over-replicated.
`replicationThrottleLimit` does not act on first replica in any tier	Throttling first replica on a tier undermines the purpose of tiering. Tiering is not meant for fault tolerance, rather serving different query needs. Thus segments should be available on target tiers as soon as possible.
⚠️ `maxNonPrimaryReplicantsToLoad` does not act on first replica in any tier	This was done keeping in line with the changes to `replicationThrottleLimit` but it should probably be reverted to prevent unexpected behaviour for clusters using this config. cc: @capistrant
`maxSegmentsInNodeLoadingQueue` acts on the number of items assigned to the load queue in the current run rather than the number of items present in the queue at a given time.	Currently, if the configured load queue size is large enough to allow load of some segments while a coordinator run is in progress, the load queue limit is violated as there is always some room in the queue. This causes coordinator runs to get stuck cycling through all the segments in spite of a limited load queue.

Structural changes

Change	Motivation
Add `SegmentLoader` which handles all the load, move and drop operations. The lifecycle of the loader is tied to a single coordinator run. In the changed code, it is instantiated once in every run of `RunRules` and `BalanceSegments`.	- Allow reuse of logic for loading, balancing and broadcasting. - Single place to maintain state of a single run thus allowing better metrics and logging.
Load rules just specify their desired state and leave the actual decision making to the `SegmentLoader`.	Simpler logic for load rules.
Add `SegmentStateManager` that maintains state across coordinator runs and interacts with the load queues.	- Single place to interact with load queue and maintain state of all in-flight segments - Allow reporting of metrics from queue callbacks. - Prevent callbacks from holding references to items from the previous coordinator run.

New metrics

segment/cancelLoad/count
segment/cancelDrop/count
segment/broadcastLoad/count
segment/broadcastDrop/count
some more being added

Pending changes in this PR

Remove old dead code from LoadRule, BroadcastDistributionRule
Fix failing tests
Add more UTs and simulation tests for SegmentLoader
Double-check logs and connect up new metrics
Update metric and config documentation
Complete self-review

Further work

Add prioritization strategy inside LoadQueuePeons. Current prioritization is based only on time. Strategy could be something like:
- action: drop > load (already implemented)
- segment interval: new > old (already implemented)
- load type: primary > broadcast > replica > balancing move
Adjust timeout behaviour and the default value for load queue timeout because an unlimited queue means that segments can be in the queue for much longer than before
Fix balancing strategy to ensure that unlimited load queues do not cause coordinator runs to take forever.
Allow balancing strategy to pick moving segments (this has been allowed in this PR but not enabled in any of the existing strategies)
Add metrics to identify time spent by items in load queue

This PR has:

[ ] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
[ ] added documentation for new or modified features or behaviors.
[x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[ ] added or updated version, license, or notice information in licenses.yaml
[ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[ ] added integration tests.
[ ] been tested in a test Druid cluster.

Oct 10 '22 02:10 kfaraz