redpanda icon indicating copy to clipboard operation
redpanda copied to clipboard

Review each unknown_server_error in the transactinal area (tx_gateway_frontend, rm_stm)

Open rystsov opened this issue 3 years ago • 2 comments

unknown_server_error is a fatal error (an app is required to recreate a producer) and we should return it only if it's the only way to handle the situation.

rystsov avatar Oct 27 '22 17:10 rystsov

We may make a operation idempotent and retry it until it passes or until it times out (see begin_tx, commit_tx). For tx coordinator fails it may start looking for a new coordinator and redirecting a request along the internal id info (tx_seq) to let it dedupe the request.

rystsov avatar Oct 27 '22 17:10 rystsov

Good news, I added chaos tests (so far in a private branch - https://github.com/rystsov/chaos/tree/unknown_server_error) to fail on transient unknown server errors so we have a easy way to reproduce the issue

rystsov avatar Nov 04 '22 17:11 rystsov

/backport v22.3.x

rystsov avatar Dec 11 '22 00:12 rystsov