Exception while replaying journals, shutting down
We're running Apache Pulsar with bookkeeper on a Kubernetes cluster and we have bookies (2/3) failing to restart with the following exception: 09:49:25.511 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down java.io.IOException: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException at org.apache.bookkeeper.bookie.Bookie$4.process(Bookie.java:895) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Journal.replay(Journal.java:866) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:901) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:922) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:141) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:58) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:78) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$2(LifecycleComponentStack.java:113) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:407) [com.google.guava-guava-25.1-jre.jar:?] at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:113) [org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:80) [org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.Main.doMain(Main.java:229) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.Main.main(Main.java:203) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:314) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] Caused by: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException
The exception is re-thrown so not sure of root cause but OperationRejectedException is thrown from only one place in version 4.9.
@andrekramer1 operations are rejected when ledger storage is not able to flush within 10 seconds.
Can you provide more information about your environment?
- Where do you run your kubernetes cluster? Is it on the cloud or on-premise?
- What kind of disks are you using? What is the write bandwidth?
This was on a cluster of 3 Ubuntu VMs on a virtualized server. The ledger storage was a local disk mount using a k8s storage claim and the disk appeared to be accessible but the bookie repeatedly failed to start. We were unable to recover from it and deleted the bookie node.
We are experiencing same issue where couple of nodes are not able to start any journal thread. The start up fails with an error "Exception while replaying journals, shutting down".
We are running the service on AWS instances with journal using gp2 volume and ledger using st1 volume. These are dedicated volumes for journal and ledger. OS is on it's own volume.
Any way to recover data from these nodes.
This is the exception that we see. The bookkeeper cluster is part of apache pulsar deployment.
21:24:25.285 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down java.io.IOException: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException at org.apache.bookkeeper.bookie.Bookie$4.process(Bookie.java:895) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Journal.replay(Journal.java:866) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:901) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:922) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:141) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:58) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:78) ~[org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$2(LifecycleComponentStack.java:113) ~[org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:408) [com.google.guava-guava-21.0.jar:?] at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:113) [org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:80) [org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.Main.doMain(Main.java:229) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.Main.main(Main.java:203) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:314) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] Caused by: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException
Also seeing something similar in Bookkeeper 4.10 (part of Pulsar 2.6.1)
19:46:57.746 [main] ERROR org.apache.bookkeeper.bookie.Journal - Invalid record found with negative length: -91356163
19:46:57.746 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down
java.io.IOException: Invalid record found with negative length -91356163
@frankjkelly the error seems pretty different Can you please open a new ticket?
It's OK I figured out a way around my issue using bookkeeper shell bookieformat -deletecookie -f on the bookkeeper pod experiencing the problem and then restarted the node and the problem went away
Hi,
I had this issue today and found out the bookie journal is using all my EBS IOPS burst credit. My default journal size is 50GB and I have an above 50k/s publishing rate in the persistent topic.
The issue went away after the disk resized to 200GB with 600 IOPS base performance.
Thanks @tuapuikia - yeah when we increased our EBS disks to GP3 and properly sized our write caches we did not see this any more. We made a number of tweaks - hard to tell which caused the issue to be resolved.
@frankjkelly Doesn't that command wipe the bookie?
this will occur in when you shutdown bk exceptionally, maybe oom or kill -9 ,
10:15:55.026 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down
java.io.IOException: Invalid record found with negative length -448299468
at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.replay(Bookie.java:945) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:911) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:965) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:156) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:68) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:83) ~[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$4(LifecycleComponentStack.java:144) ~[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) [com.google.guava-guava-30.1-jre.jar:?]
at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:144) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:85) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.server.Main.doMain(Main.java:234) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.server.Main.main(Main.java:208) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
10:15:55.134 [main] INFO org.apache.zookeeper.ZooKeeper - Session: 0x200064a14681be2 closed
force shutdown is not a better way we should try our best to recover data
but this scene will occur sometimes , maybe bookkeeper should provide a parameter to startup with skip this info.
Bump