bookkeeper icon indicating copy to clipboard operation
bookkeeper copied to clipboard

Exception while replaying journals, shutting down

Open andrekramer1 opened this issue 6 years ago • 11 comments

We're running Apache Pulsar with bookkeeper on a Kubernetes cluster and we have bookies (2/3) failing to restart with the following exception: 09:49:25.511 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down java.io.IOException: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException at org.apache.bookkeeper.bookie.Bookie$4.process(Bookie.java:895) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Journal.replay(Journal.java:866) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:901) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:922) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:141) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:58) ~[org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:78) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$2(LifecycleComponentStack.java:113) ~[org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:407) [com.google.guava-guava-25.1-jre.jar:?] at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:113) [org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:80) [org.apache.bookkeeper-bookkeeper-common-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.Main.doMain(Main.java:229) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.server.Main.main(Main.java:203) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:314) [org.apache.bookkeeper-bookkeeper-server-4.9.2.jar:4.9.2] Caused by: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException

The exception is re-thrown so not sure of root cause but OperationRejectedException is thrown from only one place in version 4.9.

andrekramer1 avatar Dec 18 '19 10:12 andrekramer1

@andrekramer1 operations are rejected when ledger storage is not able to flush within 10 seconds.

Can you provide more information about your environment?

  • Where do you run your kubernetes cluster? Is it on the cloud or on-premise?
  • What kind of disks are you using? What is the write bandwidth?

sijie avatar Jan 26 '20 19:01 sijie

This was on a cluster of 3 Ubuntu VMs on a virtualized server. The ledger storage was a local disk mount using a k8s storage claim and the disk appeared to be accessible but the bookie repeatedly failed to start. We were unable to recover from it and deleted the bookie node.

andrekramer1 avatar Jan 27 '20 09:01 andrekramer1

We are experiencing same issue where couple of nodes are not able to start any journal thread. The start up fails with an error "Exception while replaying journals, shutting down".

We are running the service on AWS instances with journal using gp2 volume and ledger using st1 volume. These are dedicated volumes for journal and ledger. OS is on it's own volume.

Any way to recover data from these nodes.

pushkar-engagio avatar Feb 15 '20 22:02 pushkar-engagio

This is the exception that we see. The bookkeeper cluster is part of apache pulsar deployment.

21:24:25.285 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down java.io.IOException: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException at org.apache.bookkeeper.bookie.Bookie$4.process(Bookie.java:895) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Journal.replay(Journal.java:866) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:901) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:922) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:141) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:58) ~[org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:78) ~[org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$2(LifecycleComponentStack.java:113) ~[org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:408) [com.google.guava-guava-21.0.jar:?] at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:113) [org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:80) [org.apache.bookkeeper-bookkeeper-common-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.Main.doMain(Main.java:229) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.server.Main.main(Main.java:203) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] at org.apache.bookkeeper.proto.BookieServer.main(BookieServer.java:314) [org.apache.bookkeeper-bookkeeper-server-4.9.0.jar:4.9.0] Caused by: org.apache.bookkeeper.bookie.BookieException$OperationRejectedException

pushkar-engagio avatar Feb 15 '20 22:02 pushkar-engagio

Also seeing something similar in Bookkeeper 4.10 (part of Pulsar 2.6.1)

19:46:57.746 [main] ERROR org.apache.bookkeeper.bookie.Journal - Invalid record found with negative length: -91356163
19:46:57.746 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down
java.io.IOException: Invalid record found with negative length -91356163

frankjkelly avatar Dec 08 '20 21:12 frankjkelly

@frankjkelly the error seems pretty different Can you please open a new ticket?

eolivelli avatar Dec 08 '20 21:12 eolivelli

It's OK I figured out a way around my issue using bookkeeper shell bookieformat -deletecookie -f on the bookkeeper pod experiencing the problem and then restarted the node and the problem went away

frankjkelly avatar Dec 08 '20 21:12 frankjkelly

Hi,

I had this issue today and found out the bookie journal is using all my EBS IOPS burst credit. My default journal size is 50GB and I have an above 50k/s publishing rate in the persistent topic.

The issue went away after the disk resized to 200GB with 600 IOPS base performance.

tuapuikia avatar Jun 14 '21 13:06 tuapuikia

Thanks @tuapuikia - yeah when we increased our EBS disks to GP3 and properly sized our write caches we did not see this any more. We made a number of tweaks - hard to tell which caused the issue to be resolved.

frankjkelly avatar Jun 14 '21 13:06 frankjkelly

@frankjkelly Doesn't that command wipe the bookie?

devinbost avatar Jul 15 '21 17:07 devinbost

this will occur in when you shutdown bk exceptionally, maybe oom or kill -9 ,


10:15:55.026 [main] ERROR org.apache.bookkeeper.bookie.Bookie - Exception while replaying journals, shutting down
java.io.IOException: Invalid record found with negative length -448299468
        at org.apache.bookkeeper.bookie.Journal.scanJournal(Journal.java:821) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.replay(Bookie.java:945) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.readJournal(Bookie.java:911) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.start(Bookie.java:965) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.BookieServer.start(BookieServer.java:156) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.server.service.BookieService.doStart(BookieService.java:68) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:83) ~[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.component.LifecycleComponentStack.lambda$start$4(LifecycleComponentStack.java:144) ~[org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at com.google.common.collect.ImmutableList.forEach(ImmutableList.java:406) [com.google.guava-guava-30.1-jre.jar:?]
        at org.apache.bookkeeper.common.component.LifecycleComponentStack.start(LifecycleComponentStack.java:144) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.component.ComponentStarter.startComponent(ComponentStarter.java:85) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.server.Main.doMain(Main.java:234) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.server.Main.main(Main.java:208) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
10:15:55.134 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x200064a14681be2 closed

image

force shutdown is not a better way we should try our best to recover data

but this scene will occur sometimes , maybe bookkeeper should provide a parameter to startup with skip this info.

leizhiyuan avatar Aug 01 '22 06:08 leizhiyuan

Bump

frankjkelly avatar May 02 '23 12:05 frankjkelly