bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Auditor exits silently on ZK timeout

Open dlg99 opened this issue 3 years ago • 1 comments

BUG REPORT

Describe the bug

Autorecovery running standalone:

2022-06-14T05:14:25,461 [main] INFO  org.apache.bookkeeper.client.BookKeeperAdmin - Resetting LostBookieRecoveryDelay value: 0, to kickstart audit task
2022-06-14T05:14:25,461 [main] DEBUG org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - setLostBookieRecoveryDelay()
2022-06-14T05:14:25,612 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x3000418b60b0047 closed
2022-06-14T05:14:25,612 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x3000418b60b0047
2022-06-14T05:14:25,612 [main] INFO  org.apache.bookkeeper.meta.ZkLedgerAuditorManager - Shutting down AuditorElector

after loss of ZK connectivity.

Eventually this can get to the situation when there is no Auditor in the cluster.

Expected behavior

Auditor shutdown should either result in the thread restart/attempted reconnect to ZK, if needed or should trigger AR service's shutdown/fail healthcheck so k8s has a chance to restart the service.

dlg99 avatar Jun 21 '22 21:06 dlg99

@dlg99 according this code snippet, the DeatchWather should be able to catch the case and shut down the whole autorecovery right? https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/replication/AutoRecoveryMain.java#L213-L240

     setUncaughtExceptionHandler((thread, cause) -> {
                LOG.info("AutoRecoveryDeathWatcher exited loop due to uncaught exception from thread {}",
                    thread.getName(), cause);
                shutdown();
            });
        }

        @Override
        public void run() {
            while (true) {
                try {
                    Thread.sleep(watchInterval);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                }
                // If any one service not running, then shutdown peer.
                if (!autoRecoveryMain.auditorElector.isRunning() || !autoRecoveryMain.replicationWorker.isRunning()) {
                    LOG.info(
                            "AutoRecoveryDeathWatcher noticed the AutoRecovery is not running any more,"
                            + "exiting the watch loop!");
                    /*
                     * death watcher has noticed that AutoRecovery is not
                     * running any more throw an exception to fail the death
                     * watcher thread and it will trigger the uncaught exception
                     * handler to handle this "AutoRecovery not running"
                     * situation.
                     */
                    throw new RuntimeException("AutoRecovery is not running any more");
                }

MarvinCai avatar Sep 10 '22 15:09 MarvinCai