redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

KIP-360: Improve reliability of idempotent/transactional producer

Open VadimPlh opened this issue 3 years ago • 2 comments

Cover letter

rfc: link

Problem (from link )

Fatal error occurs when the producer cannot assign sequence numbers to the records that it produces. In order to maintain the correct order, records produced by an idempotent or transactional producer each have a sequence number, and the broker only accepts writes in sequential order, with no gaps or out-of-order records allowed. In order to assign the correct sequence number, the producer needs to know which requests have been successfully written to the log, which requires a successful response from the broker.

If a produce request fails with a retriable error, the producer will retry it until it either succeeds or hits the configured delivery timeout (delivery.timeout.ms), at which point the records are expired by the producer. If the records expire, the producer can’t be sure if the records were written or not. When this happens, the producer can’t continue because it doesn’t know what sequence number to assign to the next record

How to solve

KIP-360 solves these problems by providing a way to re-initialize a producer ID without having to create a new producer. In addition to the producer’s transactional ID, InitProducerId now optionally takes a producer ID and producer epoch as well. When these are present in the request, the broker compares them to the existing producer ID and epoch for that transactional ID. If they match, then no other producer has been initialized for that transactional ID, and it is safe for the producer to continue processing. The broker will increment the producer epoch and return it to the producer. When the epoch is bumped, the sequence number is also reset to zero, allowing the producer to continue through both unknown producer and out of sequence errors.

With a safe way to bump its epoch, the producer can now recover from a number of previously fatal errors. When the producer encounters one of these errors, and the broker supports the new InitProducerId version, it will transition to an abortable error state, rather than a fatal one. When the application aborts the transaction, the producer will internally call InitProducerId after aborting, which bumps the epoch and allows it to continue. Because this epoch bump happens transparently as part of the call to KafkaProducer#abortTransaction

Fixes #3278

Testing

Python client has problems with this logic. (It tries to complete send after timeout and etc), so I tested it by hand.

Release notes

  • Support safe epoch incrementing for idempotent/transactional producer in retries cases

VadimPlh avatar Jul 06 '22 14:07 VadimPlh

Trying to wrap my head around KIP-360 proposal. We don't need to implement the section on Prolonged producer state retention because we already keep that state for 7d since last write with that pid, right? (Unlike in kafka proposal where they try to keep it up to date with log offsets and hence had to decouple).

bharathv avatar Jul 07 '22 21:07 bharathv

Trying to wrap my head around KIP-360 proposal. We don't need to implement the section on Prolonged producer state retention because we already keep that state for 7d since last write with that pid, right? (Unlike in kafka proposal where they try to keep it up to date with log offsets and hence had to decouple).

Right, they used to keep state in sync with the log but with KIP-360 they let the state to outlive the log eviction. We already do it so no need to change this part.

rystsov avatar Jul 12 '22 16:07 rystsov

Something is failing

rystsov avatar Aug 23 '22 06:08 rystsov

would it be possible to use different client to write a test explicitly testing KiP-360 support ? Maybe we can leverage kcl ?

mmaslankaprv avatar Aug 24 '22 18:08 mmaslankaprv

would it be possible to use different client to write a test explicitly testing KiP-360 support ? Maybe we can leverage kcl ?

Test for this kip https://github.com/redpanda-data/chaos/pull/17 It uses java kafka client.

VadimPlh avatar Aug 30 '22 10:08 VadimPlh

tests are failing

Failed to import rptest.tests.transactions_test, which may indicate a broken test that cannot be loaded: NameError: name 'RESTART_LOG_ALLOW_LIST' is not defined

rystsov avatar Aug 30 '22 14:08 rystsov

i think this needs unit tests for changes to to tm_transaction (the new version of the structure and its compatibility with other versions), as well as tests for changes to the serde version of init_producer_id rpc.

For new version og structure, I thinl upgrade test goes it For rpc, it can be hard...

VadimPlh avatar Sep 07 '22 13:09 VadimPlh

Failures

  • https://github.com/redpanda-data/redpanda/issues/6333
  • https://github.com/redpanda-data/redpanda/issues/6328

VadimPlh avatar Sep 09 '22 10:09 VadimPlh