KIP-360: Improve reliability of idempotent/transactional producer
Cover letter
rfc: link
Problem (from link )
Fatal error occurs when the producer cannot assign sequence numbers to the records that it produces. In order to maintain the correct order, records produced by an idempotent or transactional producer each have a sequence number, and the broker only accepts writes in sequential order, with no gaps or out-of-order records allowed. In order to assign the correct sequence number, the producer needs to know which requests have been successfully written to the log, which requires a successful response from the broker.
If a produce request fails with a retriable error, the producer will retry it until it either succeeds or hits the configured delivery timeout (delivery.timeout.ms), at which point the records are expired by the producer. If the records expire, the producer can’t be sure if the records were written or not. When this happens, the producer can’t continue because it doesn’t know what sequence number to assign to the next record
How to solve
KIP-360 solves these problems by providing a way to re-initialize a producer ID without having to create a new producer. In addition to the producer’s transactional ID, InitProducerId now optionally takes a producer ID and producer epoch as well. When these are present in the request, the broker compares them to the existing producer ID and epoch for that transactional ID. If they match, then no other producer has been initialized for that transactional ID, and it is safe for the producer to continue processing. The broker will increment the producer epoch and return it to the producer. When the epoch is bumped, the sequence number is also reset to zero, allowing the producer to continue through both unknown producer and out of sequence errors.
With a safe way to bump its epoch, the producer can now recover from a number of previously fatal errors. When the producer encounters one of these errors, and the broker supports the new InitProducerId version, it will transition to an abortable error state, rather than a fatal one. When the application aborts the transaction, the producer will internally call InitProducerId after aborting, which bumps the epoch and allows it to continue. Because this epoch bump happens transparently as part of the call to KafkaProducer#abortTransaction
Fixes #3278
Testing
Python client has problems with this logic. (It tries to complete send after timeout and etc), so I tested it by hand.
Release notes
- Support safe epoch incrementing for idempotent/transactional producer in retries cases
Trying to wrap my head around KIP-360 proposal. We don't need to implement the section on Prolonged producer state retention because we already keep that state for 7d since last write with that pid, right? (Unlike in kafka proposal where they try to keep it up to date with log offsets and hence had to decouple).
Trying to wrap my head around KIP-360 proposal. We don't need to implement the section on
Prolonged producer state retentionbecause we already keep that state for 7d since last write with that pid, right? (Unlike in kafka proposal where they try to keep it up to date with log offsets and hence had to decouple).
Right, they used to keep state in sync with the log but with KIP-360 they let the state to outlive the log eviction. We already do it so no need to change this part.
Something is failing
would it be possible to use different client to write a test explicitly testing KiP-360 support ? Maybe we can leverage kcl ?
would it be possible to use different client to write a test explicitly testing KiP-360 support ? Maybe we can leverage
kcl?
Test for this kip https://github.com/redpanda-data/chaos/pull/17 It uses java kafka client.
tests are failing
Failed to import rptest.tests.transactions_test, which may indicate a broken test that cannot be loaded: NameError: name 'RESTART_LOG_ALLOW_LIST' is not defined
i think this needs unit tests for changes to to tm_transaction (the new version of the structure and its compatibility with other versions), as well as tests for changes to the serde version of init_producer_id rpc.
For new version og structure, I thinl upgrade test goes it For rpc, it can be hard...
Failures
- https://github.com/redpanda-data/redpanda/issues/6333
- https://github.com/redpanda-data/redpanda/issues/6328