TransactionClient `gc` function often enters error loop during CleanupLocks
I have a program that runs periodically which makes a single call to the TransactionClient gc function. Around half of the time it enters some loop which prints thousands of these logs and eventually OOMs. It is checking many ranges/regions for the the same key. It is a different key each time it enters the loop. They are INFO logs originating from client-rust-da362376b56921db/1fa846b/src/request/plan.rs:686.
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE A START>, <REDACTED RANGE A END>) for region 116477
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE B START>, <REDACTED RANGE B END>) for region 116288
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE C START>, <REDACTED RANGE C END>) for region 970120
CleanupLocks::execute, inner region error:key <REDACTED KEY> is not in region key range [<REDACTED RANGE D START>, <REDACTED RANGE D END>) for region 969086
...
Could you show your program or some snippet can reproduce this issue ?
As well as the setup of TiKV cluster.
I'm afraid GC is not stable yet(#180). After a quick look at the code (haven't reviewed this part for years), one possible problem to investigate: the scan_lock request might be directly passed to cleanup_locks without proper region setup (e.g. via something like retry_multi_region)
https://github.com/tikv/client-rust/blob/59f13b57005df508ad7a0d81126e088003d7fce8/src/transaction/client.rs#L264-L271
This change was introduced in PR #378, which added async commit lock resolution support. As a workaround, you might want to temporarily disable async commit and use the pre-PR GC implementation until this issue is resolved.