bookkeeper
bookkeeper copied to clipboard
Auditor exits silently on ZK timeout
BUG REPORT
Describe the bug
Autorecovery running standalone:
2022-06-14T05:14:25,461 [main] INFO org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task
2022-06-14T05:14:25,461 [main] DEBUG org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - setLostBookieRecoveryDelay()
2022-06-14T05:14:25,612 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x3000418b60b0047 closed
2022-06-14T05:14:25,612 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3000418b60b0047
2022-06-14T05:14:25,612 [main] INFO org.apache.bookkeeper.meta.ZkLedgerAuditorManager - Shutting down AuditorElector
after loss of ZK connectivity.
Eventually this can get to the situation when there is no Auditor in the cluster.
Expected behavior
Auditor shutdown should either result in the thread restart/attempted reconnect to ZK, if needed or should trigger AR service's shutdown/fail healthcheck so k8s has a chance to restart the service.
@dlg99 according this code snippet, the DeatchWather should be able to catch the case and shut down the whole autorecovery right? https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/AutoRecoveryMain.java#L213-L240
setUncaughtExceptionHandler((thread, cause) -> {
LOG.info("AutoRecoveryDeathWatcher exited loop due to uncaught exception from thread {}",
thread.getName(), cause);
shutdown();
});
}
@Override
public void run() {
while (true) {
try {
Thread.sleep(watchInterval);
} catch (InterruptedException ie) {
Thread.currentThread().interrupt();
}
// If any one service not running, then shutdown peer.
if (!autoRecoveryMain.auditorElector.isRunning() || !autoRecoveryMain.replicationWorker.isRunning()) {
LOG.info(
"AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,"
+ "exiting the watch loop!");
/*
* death watcher has noticed that AutoRecovery is not
* running any more throw an exception to fail the death
* watcher thread and it will trigger the uncaught exception
* handler to handle this "AutoRecovery not running"
* situation.
*/
throw new RuntimeException("AutoRecovery is not running any more");
}