kafka KAFKA-17002: Integrated partition leader epoch for Persister APIs (KIP-932)

The PR integrates leader epoch for partition while invoking Persister APIs. The write RPC is retried once on leader epoch failure.

Committer Checklist (excluded from commit message)

[ ] Verify design and implementation
[ ] Verify test coverage and CI build status
[ ] Verify documentation (including upgrade notes)

Aug 08 '24 23:08 apoorvmittal10

I'm not quite sure about this one. Integrating the leader epoch into the persister write API is good, but I am not sure whether the handling of the fenced case makes sense. Essentially, if the Write RPC returns a fenced error, it means that a later leader epoch is being used for this share-partition state, and that means another share-partition leader for this partition is using a later epoch, which presumably means this SPL is philosophically dead.

Hmm, the only scenario I was thinking about that if partition epoch is bumped for any other reason and same broker remains the partition leader then a retry should be needed to fetch latest partition epoch. If that's not true then I agree that retry doesn't make sense.

I was also thinking if we should have an identiifer/boolean in SPL that can be toggled when SPL is not a leader. This identifier once set to false will fail all requests with not leader exception, if received any call. So client can switch to new leader.

Aug 09 '24 12:08 apoorvmittal10

@mumrah Can you please review as well and provide input on Partition epoch. @AndrewJSchofield might be right and we might just want to fail the request when such leader error occurs. Just confirming prior making the change.

Aug 15 '24 09:08 apoorvmittal10

@junrao @mumrah Can you please advise me about the partition epoch handling please. It would be great if you can take a look please.

Aug 20 '24 20:08 apoorvmittal10

@junrao Sorry, I was in middle of my changes when last reviewed. But thanks for the feedback. I have completed the changes now, can you please re-review. Sorry for the trouble.

Oct 22 '24 20:10 apoorvmittal10

@junrao I have made a lot changes and the PR is getting bigger and bigger. I am reverting some of my local changes and will write a plan with jiras so we can proceed.

Oct 28 '24 20:10 apoorvmittal10

So here is the summary. Sorry, the PR is growing. As now it's more of error handling and less of just leader epoch propagation.

The current PR has following:

Fetches LeaderEpoch and passes to SharePartition. If leader epoch (fenced/state epoch), unknown topic partiton or unknown group error arrives then Share Partition will be removed from cache. At the same time the share partition will be marked fenced.
For same, Fenced state has been added in the PR.
There exists 2 checks for fenced state in SharePartition a) While acquiring records b) While acquiring fetch lock. Hence any inflight request which tries to acquire records will fail. Though even prior to that requests will start failing while acquiring lock. This prevents us to send any new records to consumers for fenced share partitions. However, I didn't add the check on acknowledge and release APIs as they will eventually fail while persisting, if should.
Added the error handling for leader epoch at SharePartiton level, which means if in a request of 5 topic partitions, if 3 are fenced then the request should proceed for remaining 2 share partitions rather failing completely.

Some follow ups to do:

https://issues.apache.org/jira/browse/KAFKA-17510 - Additional error handling and trigger fetch, once the initilization of share partition is completed and the request is in purgatory.
https://issues.apache.org/jira/browse/KAFKA-17887 - Error handling for response log result in delayed share fetch.
Better state machine transition in Share Partition.
Consider using listener to fence and remove Share Partition.
Add a check for leader epoch prior removing share partition instance.
Send error partitions response as well, currenlty if all partitions fail or succeed then fail/success response is sent for all else only for partitions for which data is fetched.
Additionally I am thinking to rename ShareFetchData to ShareFetch and provide better handling of future completion as the code is a bit fragmented today.

I am planning to have couple of follow up PRs, if @junrao @AndrewJSchofield @mumrah you think is fine. As with tests and changes, this PR is getting bigger and bigger.

Oct 29 '24 01:10 apoorvmittal10

Hi @junrao , thanks for the feedback. Addressed and replied. There is one test which was previously marked as flaky but I can see the failure in this PR as well. I have updated the ticket for the flaky test as well.

https://issues.apache.org/jira/browse/KAFKA-17463?jql=text%20~%20%22testShareGroups%22

Oct 30 '24 18:10 apoorvmittal10