druid Add new balancer implementation called EnumeratedDistribution Balancer

Description

EnumeratedDistribution Balancer

I have implemented a new BalancerStrategy implementation called EnumeratedDistribution. This Balancer Uses org.apache.commons.math3.distribution.EnumeratedDistribution to decide what servers to load segments to and what server to move a segment to. For choosing which server(s) to drop a segment from, it does not use this probabilistic sampling, but instead just orders the servers by % utilized in order to drop from the most "full" candidate server first.

The overarching goal of the balancer is to drive a cluster's storage distribution towards uniform disk utilization % across historical servers.

It differs from the diskNormalized balancer in that this balancer is completely disconnected from the cost balancer. It is strictly assigning segment destinations by probabilistic sampling that EnumeratedDistribution uses.

Replicating a segment

Assign probabilities to every candidate server. The probability input for server s is a float between 0 and 1. The probability for a server correlates to it's utilization %. The most utilized server in the cluster has the lowest probability assigned and the least utilized server has the highest probability assigned.

This is where the impl gets a little funky compared to the other Balancers. The other balancers return an Iterator over all of the candidate servers after they have been sorted according to the balancer algorithm. For this new balancer, it is less straightforward. We sample the set of candidates with the aforementioned probabilities in mind (EnumeratedDistribution library handles all this for us) a number of times that is equal to the size of the provided candidate list. We then de-dup the list of sampled servers since it will likely - by design - have picked the higher probability servers multiple times. This means the return list to the caller is not going to have every candidate. The feelin is that this shouldn't be a problem when you have a meaningful number of historicals since the number of replicas is likely fairly low. However, the method signature could be changed to define a minimum number of servers returned if we needed to guarantee that N servers were in the iterable that the caller receives.

Moving a segment

Assign probabilities to every candidate server. The probability input for an individual server is a float between 0 and 1 that correlates to it's disk utilization %. The most utilized server in the cluster has the lowest probability assigned and the least utilized server has the highest probability assigned.

Sample from the set of candidate servers with the aforementioned probabilities in mind (EnumeratedDistribution library handles all this for us). Return the sampled server.

Dropping a segment

Order the candidate servers in descending order by segment cache space utilization % and return an iterable over them. This pushes the cluster to drop from the "most full" servers first, working towards the ultimate goal of converging on a uniform utilization % across historical servers.

Release note

Add a new BalancerStrategy called EnumeratedDistributionBalancerStrategy. This balancer uses org.apache.commons.math3.distribution.EnumeratedDistribution to make segment assignment, move, and drop decisions using probabilities that correlate to how much available segment storage a server has remaining. This balancer will converge historical servers to a uniform level of storage consumption (expressed as a % of available storage utilized on each server). To enable this balancer, set druid.coordinator.balancer.strategy to enumeratedDistribution.

Key changed/added classes in this PR

EnumeratedDistributionBalancerStrategy

This PR has:

[x] been self-reviewed.
[X] added documentation for new or modified features or behaviors.
[X] a release note entry in the PR description.
[X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[X] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[ ] added integration tests.
[X] been tested in a test Druid cluster.

Apr 25 '24 19:04 capistrant

Thanks for adding the new balancer strategy, @capistrant !

The overarching goal of the balancer is to drive a cluster's storage distribution towards uniform disk utilization % across historical servers.

Do you think the current strategies fail to achieve this in certain cases? It would be nice to know some real world scenarios where the new strategy does a better job of balancing servers as compared to the current strategies.

It would also be nice to have some tests based on the CoordinatorSimulationBaseTest that demonstrates how this strategy performs in a typical setup.

Apr 26 '24 07:04 kfaraz

Thanks for adding the new balancer strategy, @capistrant !

The overarching goal of the balancer is to drive a cluster's storage distribution towards uniform disk utilization % across historical servers.

Do you think the current strategies fail to achieve this in certain cases? It would be nice to know some real world scenarios where the new strategy does a better job of balancing servers as compared to the current strategies.

From my experience running a large high segment count (over 6 million), high datasource count (over 12k) cluster with great variance in segment size (kb to gb) - the cost balancer creates a highly skewed environment when it comes to disk allocation. Maybe if cpu constraints were not a problem and we could balance many segments in short order - the algorithm would work its way towards having less storage skew. But in our environment even with a 90 cores on coordinator dedicated to balancing, we can not balance meaningful number of segments while also keeping coordination within reason. We had servers with nearly 90% utilization and others with under 50% with little to no progress towards balance over months and months of uptime.

Perhaps, an alternative for us could have been cachingCost, if that meaningfully drops cpu. But we had experienced odd behavior that we couldn't pinpoint in the past when testing it in lower envs, and thus abandoned the idea of using it in prod until exhaustive testing could be done.

That left us with the random balancer, which by nature leads to skew in an environment like ours where segment sizes vary so much. We did really like the performance in terms of how many segments we could balance though. And this was the trigger for us to write this balancer that uses simple weighting to avoid the pitfalls of random for us, while still allowing us to move many segments in a short amount of time.

It would also be nice to have some tests based on the CoordinatorSimulationBaseTest that demonstrates how this strategy performs in a typical setup.

I will look into this shortly, thanks for bringing it to my attention!

Apr 26 '24 16:04 capistrant

Thanks for adding the new balancer strategy, @capistrant !

The overarching goal of the balancer is to drive a cluster's storage distribution towards uniform disk utilization % across historical servers.

Do you think the current strategies fail to achieve this in certain cases? It would be nice to know some real world scenarios where the new strategy does a better job of balancing servers as compared to the current strategies.

From my experience running a large high segment count (over 6 million), high datasource count (over 12k) cluster with great variance in segment size (kb to gb) - the cost balancer creates a highly skewed environment when it comes to disk allocation. Maybe if cpu constraints were not a problem and we could balance many segments in short order - the algorithm would work its way towards having less storage skew. But in our environment even with a 90 cores on coordinator dedicated to balancing, we can not balance meaningful number of segments while also keeping coordination within reason. We had servers with nearly 90% utilization and others with under 50% with little to no progress towards balance over months and months of uptime.

Perhaps, an alternative for us could have been cachingCost, if that meaningfully drops cpu. But we had experienced odd behavior that we couldn't pinpoint in the past when testing it in lower envs, and thus abandoned the idea of using it in prod until exhaustive testing could be done.

That left us with the random balancer, which by nature leads to skew in an environment like ours where segment sizes vary so much. We did really like the performance in terms of how many segments we could balance though. And this was the trigger for us to write this balancer that uses simple weighting to avoid the pitfalls of random for us, while still allowing us to move many segments in a short amount of time.

It would also be nice to have some tests based on the CoordinatorSimulationBaseTest that demonstrates how this strategy performs in a typical setup.

I will look into this shortly, thanks for bringing it to my attention!

Now that I start to dig in more after looking at CoordinatorSimulationBaseTest I'm wondering if changes in coordination have made this new balancer less required for the case that I currently see. We are admittedly on an old major version of druid (24), and I only checked to see if any new balancers were added in new version(s) docs when deciding if I should contribute this back. For instance, SegmentToMoveCalculator is something I'm just now starting to learn about. I do wonder though, if we would still be compute constrained in an env like mine. The cost calculations are simply too expensive for us to move enough segments, so I don't know that any improvements in the logic to control the volume being moved makes a difference to us without significantly cheapening the balancer decision making

Apr 26 '24 20:04 capistrant

Now that I start to dig in more after looking at CoordinatorSimulationBaseTest I'm wondering if changes in coordination have made this new balancer less required for the case that I currently see.

Yes, I was thinking the same. There have been several major improvements to the Coordinator starting with Druid 27. I would recommend you to try out a newer Druid version and see if it solves the problems with your cluster. It would also be a nice validation of the latest changes made to the Coordinator. 🙂

I do wonder though, if we would still be compute constrained in an env like mine.

I shouldn't think so, because initial assignments are very fast thanks to RoundRobinServerSelector.

For balancing, the coordinator dynamically decides what it can take up without becoming too slow and balances only that.

Let me know what you think.

Apr 27 '24 02:04 kfaraz

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the [email protected] list. Thank you for your contributions.

Jun 27 '24 00:06 github-actions[bot]

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Jul 26 '24 00:07 github-actions[bot]

Now that I start to dig in more after looking at CoordinatorSimulationBaseTest I'm wondering if changes in coordination have made this new balancer less required for the case that I currently see.

Yes, I was thinking the same. There have been several major improvements to the Coordinator starting with Druid 27. I would recommend you to try out a newer Druid version and see if it solves the problems with your cluster. It would also be a nice validation of the latest changes made to the Coordinator. 🙂

@capistrant , just checking if you got a chance to try out the newer versions of Druid which have the above discussed coordinator improvements.

Aug 10 '24 09:08 kfaraz

@kfaraz unfortunately no. We have been in a holding pattern with our supported Druid version at my company due to resource constraints. Undetermined when we will actually get upgraded. Probably not until 2025 Q1 at this point if I had to guess.

Aug 13 '24 21:08 capistrant

@kfaraz unfortunately no. We have been in a holding pattern with our supported Druid version at my company due to resource constraints. Undetermined when we will actually get upgraded. Probably not until 2025 Q1 at this point if I had to guess.

Ah, I see. 🙂 Thanks for the update.

Aug 14 '24 04:08 kfaraz