[improve] decommissionBookie always waiting too long time after ledgers be replicated completed
improve
As the Decommissioning bookie case, always change the bookie status to readonly firstly, and then wait some data expired, but always it has some ledgers (about 100+ -- 300+) legacy not be cleaned and the leaved ledgers only has little data , when we running bin/bookkeeper shell decommissionbookie -bookieid to decommission the bookie , we always pending about 10 min and have not any log print, but we could find the znode /ledgers/underreplication/ledgers cleaned only few seconds and then the ledgers be rereplicate completed。
To Reproduce
Steps to reproduce the behavior:
- Go to change the bookie status to readonly;
- waiting the most ledger expired;
- stop bookie and run
bin/bookkeeper shell decommissionbookie -bookieid - See will wait long time about 10 min, even the ledgers which have few data is replicated completed, and after
Count of Ledgers which need to be rereplicated:the log printed, the 10 min have not any other printed.
Expected behavior
The waiting time not too long and tell us what happened.
I think the wait is related to https://github.com/apache/bookkeeper/blob/eadbdd4b6bfeef9924a3ff2c59fc3718cf3dc06b/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/BookKeeperAdmin.java#L1623
You can make this time configurable via the decommission command flag / add logging.
You can make this time configurable via the decommission command flag / add logging.
@dlg99
Yep, the wait time related to the config of maxSleepTimeInBetweenChecks. If we add command parameter to change the maxSleepTimeInBetweenChecks, the users maybe could not forecast how long will it take to rereplicate completed. It's a risk if user set a small maxSleepTimeInBetweenChecks and then the auditor is running to do some time-consuming operation like checkAllLedger could not audit bookie immediately,will cause check areEntriesOfLedgerStoredInTheBookie through zk too frequently and affect the zk server performance.
The PR #3339 will judgment if the /ledgers/underreplication/ledgers and /ledgers/underreplication/locks is empty to help us check if the rereplicate is completed, and backoff when the auditor is running as CheckAllLedgers or other time-consuming operation.
could you help me check the PR, Thx.