kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-12495: Exponential backoff retry to prevent rebalance storms when worker joins after revoking rebalance

Open vamossagar12 opened this issue 3 years ago • 13 comments

Currently, the Incremental rebalance protocol does not allow a subsequent revoking rebalance when a worker joins right after one. This can lead to imbalance in assignments. See KAFKA-12495 for more details.

This PR aims to fix the above. Note that there already exists another PR: https://github.com/apache/kafka/pull/10367 to fix this. The main difference between the 2 approaches is that this one introduces an exponential backoff delay between 2 successive revoking rebalances. This is to dis-allow rebalance storms and still not wait for entire scheduled rebalance delay.

Notable changes in this PR =>

  1. Allow successive revoking rebalances from the same leader.
  2. If there are successive rebalances, then we would introduce a delay between the 2 based on exponential backoff policy. This delay would be used using the existing delay/scheduledRebalance mechanism.
  3. The exponential backoff clock would be reset when the cluster reaches a balanced load i.e, there aren't successive revoking rebalances.

vamossagar12 avatar Aug 26 '22 10:08 vamossagar12

@C0urante , @showuon , @yashmayya I have created this PR as per the discussions on the ticket. Plz review whenever you get the chance. Thanks!

vamossagar12 avatar Aug 26 '22 10:08 vamossagar12

Thanks @vamossagar12. I haven't looked too closely at the code yet but have a couple high-level thoughts:

  1. We should not introduce a backoff delay the first time a new worker joins the cluster. We should only introduce the delay in between consecutive rounds with load-balancing revocation.
  2. We should reset the backoff delay as soon as we're able to complete a round of rebalancing that does not require any load-balancing revocations. IOW, as soon as everyone rejoins the cluster and we don't see any imbalances that need to be rectified by revoking connectors/tasks from one set of workers in one round, then re-allocating them to a different set of workers in a follow-up round. If we've just performed a round of revocation, and everyone has now joined the cluster and we're going to allocate those revoked connectors/tasks to new workers, as long as that allocation would lead to a balanced assignment, we can reset the clock.

C0urante avatar Aug 26 '22 13:08 C0urante

In this PR, similar to Luke's PR, I have added the condition to do a revoking rebalance consecutively. The only difference is that whether revocation would happen or not is decided by the time vis-a-vis the next scheduled revoking rebalance as per exponential backoff. I could notice the flip-side of that as I needed to add a delay in some of the tests to get a revoking rebalance if the workers joined before the next scheduled rebalance.

I think what you are saying makes sense. Let me give it a try. Thanks!

vamossagar12 avatar Aug 26 '22 15:08 vamossagar12

@C0urante , i made the changes based on my understanding of your suggestions. Plz review whenever you get the chance. Thanks!

vamossagar12 avatar Aug 28 '22 07:08 vamossagar12

Have a look at the non-test code, left some comments. And I agree with Chris that we should not introduce exponential backoff for 1st revocation, and should reset it ASAP we don't need it. Thank you for working on this.

Thanks @showuon . I removed the change to revoke the first time around. The delay is now introduced only when there are successive revoking rebalances.

vamossagar12 avatar Aug 29 '22 15:08 vamossagar12

Also, I have made the changes to use the existing delay/scheduledRebalance mechanism to delay the revocation. cc @showuon , @C0urante

vamossagar12 avatar Aug 31 '22 12:08 vamossagar12

Thanks @C0urante . I had one question on the proposed approach regarding delays. Since the other changes are smallish in nature, I would wait for your response on that one before making the rest of the changes.

vamossagar12 avatar Sep 02 '22 11:09 vamossagar12

The changes look good. Left some comments. Also, could you re-enable the test in RebalanceSourceConnectorsIntegrationTest#testMultipleWorkersRejoining? This fix should resolve the flaky test.

Thanks @showuon . I enabled it and it passed locally. Have pushed the change along with other review comments(and a couple of clarifying questions).

vamossagar12 avatar Sep 06 '22 14:09 vamossagar12

LGTM! Thanks for the improvement!

Thank you !

vamossagar12 avatar Sep 07 '22 12:09 vamossagar12

@C0urante I made the switch back as suggested. Plz review whenever you get the chance.

vamossagar12 avatar Sep 09 '22 16:09 vamossagar12

@vamossagar12 there are failing unit tests, can you check those out and ping us after you've taken a look and fixed any issues?

C0urante avatar Sep 13 '22 13:09 C0urante

@vamossagar12 there are failing unit tests, can you check those out and ping us after you've taken a look and fixed any issues?

Oops.. Sorry about that. I didn't check the test results after the latest changes. Will check/fix those and call for review post that.

vamossagar12 avatar Sep 13 '22 13:09 vamossagar12

Details

I checked in the changes by fixing the 2 unit tests that were failing. This run, there didn't seem to be connect specific failures. My bad on the oversight the last time around.

vamossagar12 avatar Sep 17 '22 10:09 vamossagar12

@showuon The follow-up PR here is ready for review; this should be the last step necessary before we can merge this PR and fix this long-standing rebalancing bug.

C0urante avatar Nov 02 '22 15:11 C0urante

@showuon Whenever you get the chance, can you also plz review this PR of Chris. We should be able to get this one through as well.

vamossagar12 avatar Nov 07 '22 04:11 vamossagar12

@C0urante , you want to take another pass on this one whenever you get the chance?

vamossagar12 avatar Nov 15 '22 16:11 vamossagar12

@vamossagar12 the build is failing with a Checkstyle error; can you take a look?

C0urante avatar Nov 15 '22 16:11 C0urante

@C0urante , i did that but looks like my force push cleaned up the PR merge since I didn't pull the latest :( I see you have deleted the branch. Would it be possible to create another PR for this? Sorry about this :(

vamossagar12 avatar Nov 15 '22 17:11 vamossagar12

@vamossagar12 It looks like the commit is still present in GitHub: https://github.com/vamossagar12/kafka/tree/422c81fc0a2dd19fee0ed1eb033ef35fc9e27ba1

Can you fetch that commit locally, then add a commit on top of it that removes the @Ignore annotation (and the import for it), and push that to this branch?

C0urante avatar Nov 15 '22 17:11 C0urante

Thanks @C0urante . i have added back the merge commit and pushed here along with fixing the checkstyle. I ran the tests locally and a few MM related ones failed.

vamossagar12 avatar Nov 15 '22 18:11 vamossagar12

Thanks @vamossagar12. I've run the tests locally and they've succeeded; will wait to see what happens with CI but I think this should be okay as-is.

C0urante avatar Nov 15 '22 18:11 C0urante

The single test failure is unrelated. Merging...

C0urante avatar Nov 15 '22 21:11 C0urante

Yeah!!! Thank you @vamossagar12 @C0urante ! Nice team work!

showuon avatar Nov 16 '22 01:11 showuon

One question to @C0urante , do you think we should backport to v3.3 branch?

showuon avatar Nov 16 '22 01:11 showuon

@showuon I was thinking about it--IMO the additional complexity introduced by this change is a little too risky to backport but I'm almost on the fence here. If you'd like to see it on 3.3 I wouldn't be opposed

C0urante avatar Nov 16 '22 03:11 C0urante

I think 3.4 release is also around the corner(few weeks maybe?). Would it better to have it in 3.4?

vamossagar12 avatar Nov 16 '22 04:11 vamossagar12

Yes, I have the same thought as Chris. Besides, this bug already exist for a long long time, it's not that urgent to put it into 3.3. So, let's keep it in 3.4 (trunk branch), which is what we currently did. :)

showuon avatar Nov 16 '22 06:11 showuon

hwy @C0urante , I was thinking should the exponential backoff thing that we have introduced as part of this PR should go somewhere in the docs? I am saying this since this is a deviation from how things worked prior to this. WDYT?

vamossagar12 avatar Jan 22 '23 08:01 vamossagar12

@vamossagar12 I don't think it's necessary to call this out anywhere unless it's caused unexpected issues with our users. The intention behind the exponential backoff is to avoid rebalance storms but I'm still not sure when those would ever realistically happen, so until we see otherwise, I'd prefer to leave it as an internal implementation detail.

C0urante avatar Jan 23 '23 14:01 C0urante

@C0urante , thanks for your response. Makes sense.

vamossagar12 avatar Jan 24 '23 14:01 vamossagar12