kafka KAFKA-14830: Illegal state error in transactional producer

When the producer's transaction manager receives a notification that an error has occurred during a transaction, it takes steps to abort the transaction and reset its internal state.

Users have reported the following case where a producer experiences timeouts while in a transaction:

The TransactionManager (TM) starts with state READY and epoch set to 0
A transaction (T1) begins and TM sets its internal state to IN_TRANSACTION
Batches are created and sent off to their respective brokers
A timeout threshold is hit
T1 starts the abort process
1. TM state is set to ABORTING_TRANSACTION
2. The batches involved with T1 are marked as expired
3. TM is reinitialized, bumping the epoch from 0 to 1 and setting its state to READY
A moment later, in the Sender thread, one of the failed batches calls handleFailedBatch()
handleFailedBatch() sets the TM state to ABORTABLE_ERROR which is an invalid state transition from READY, hence the exception

This change compares the transaction manager's current epoch (1) with the batch's epoch (0). If they're different, the batch is considered "stale" and can be ignored (though a DEBUG message is logged).

Aug 27 '24 22:08 kirktrue

I think the overall approach makes sense. But I would like to see some tests to see if the issue is improved. If so the logging could also give us some more insight.

Aug 28 '24 19:08 jolshan

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

Jan 16 '25 03:01 github-actions[bot]

Still needed, just lower priority 😞

Jan 16 '25 18:01 kirktrue

cc @k-raina

Mar 26 '25 19:03 kirktrue

I think the overall approach makes sense. But I would like to see some tests to see if the issue is improved. If so the logging could also give us some more insight.

@jolshan—The unit tests mimic the use cases that were seen in the wild. What other test cases should we consider? Thanks!

Mar 31 '25 20:03 kirktrue

@jolshan @mjsax—Anything else we need to check before we can merge? Thanks!

Jun 11 '25 19:06 kirktrue

@jolshan @mjsax—Another ping to check if we can merge... Thanks!

Jun 16 '25 21:06 kirktrue

Hey thanks. I took a step back and I'm wondering if this is the right approach. Let me try to get back to you on this -- I need to get a better idea of the whole flow. The discussions on the check made me realize I was missing something.

Jun 17 '25 16:06 jolshan

Ok -- I'm back. I think the thing that wasn't sitting right for me, and what I realized from our discussion with the producer ID overflow is whether this is the right place to make the change and the right mental model.

Specifically, we don't have a great way to distinguish benign stale requests from those that that could indicate a divergence of state or other real problem.

I'm wondering if we can get to the heart of this:

A timeout threshold is hit

T1 starts the abort process 1. TM state is set to ABORTING_TRANSACTION 2. The batches involved with T1 are marked as expired 3. TM is reinitialized, bumping the epoch from 0 to 1 and setting its state to READY

A moment later, in the Sender thread, one of the failed batches calls handleFailedBatch()

Do we know if there is any way we can ensure the exact inflight requests during the timeout are marked as stale vs just assuming this from the epoch in the batch? I think for some errors, we close the connection -- we may not want to do that here, but thinking about addressing the problem at the root and not adding small changes in ways that we may not realize the side effects for yet. (As we saw in the test failures from making the change, the current part of code may cause other issues)

Jun 17 '25 20:06 jolshan