pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[fix][Offload] fix indexEntries NullPointerException error

Open graysonzeng opened this issue 1 year ago • 4 comments

Motivation

Offloader exception occurs

21:54:32.360 [offloader-OrderedScheduler-14-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Unknown exception for ManagedLedgerException.
org.apache.bookkeeper.mledger.ManagedLedgerException$OffloadReadHandleClosedException: Offload read handle already closed
21:54:32.424 [offloader-OrderedScheduler-14-0] WARN  org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService - Unexpected throwable from task class org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl$$Lambda$2013/0x00007fdd98b45d08: Cannot invoke "java.util.TreeMap.clear()" because "this.indexEntries" is null
java.lang.NullPointerException: Cannot invoke "java.util.TreeMap.clear()" because "this.indexEntries" is null
	at org.apache.bookkeeper.mledger.offload.jcloud.impl.OffloadIndexBlockImpl.recycle(OffloadIndexBlockImpl.java:97) ~[?:?]
	at org.apache.bookkeeper.mledger.offload.jcloud.impl.OffloadIndexBlockImpl.close(OffloadIndexBlockImpl.java:358) ~[?:?]
	at org.apache.bookkeeper.mledger.offload.jcloud.impl.BlobStoreBackedReadHandleImpl.lambda$closeAsync$0(BlobStoreBackedReadHandleImpl.java:102) ~[?:?]
	at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46) ~[org.apache.bookkeeper-bookkeeper-common-4.16.3.jar:4.16.3]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.104.Final.jar:4.1.104.Final]
	at java.lang.Thread.run(Thread.java:833) ~[?:?]

Modifications

check indexEntries is not null

Verifying this change

  • [x] Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • [ ] Dependencies (add or upgrade a dependency)
  • [ ] The public API
  • [ ] The schema
  • [ ] The default values of configurations
  • [ ] The threading model
  • [ ] The binary protocol
  • [ ] The REST endpoints
  • [ ] The admin CLI options
  • [ ] The metrics
  • [ ] Anything that affects deployment

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

Matching PR in forked repository

PR in forked repository:

graysonzeng avatar Feb 06 '24 12:02 graysonzeng

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

@Technoboy- thanks. done

graysonzeng avatar Feb 07 '24 01:02 graysonzeng

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

lhotari avatar Feb 12 '24 14:02 lhotari

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

maybe

Technoboy- avatar Feb 13 '24 13:02 Technoboy-

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

maybe

There have been bugs in the past with recycled objects that are caused by releasing the object multiple times.

lhotari avatar Feb 18 '24 08:02 lhotari

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

maybe

There have been bugs in the past with recycled objects that are caused by releasing the object multiple times.

yes, but for this patch, it's ok to fix it like this, right?

Technoboy- avatar Feb 20 '24 17:02 Technoboy-

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

maybe

There have been bugs in the past with recycled objects that are caused by releasing the object multiple times.

yes, but for this patch, it's ok to fix it like this, right?

I doubt that it's correct. The problem will become worse if it is "fixed" like this. I think that it is necessary to address the root cause.

lhotari avatar Feb 20 '24 20:02 lhotari

The close method maybe be called more than once, so cause the issue. I think we can just let indexEntries=null, no need to clear it

Wouldn't that be a problem if the object instance gets recycled multiple times?

maybe

There have been bugs in the past with recycled objects that are caused by releasing the object multiple times.

yes, but for this patch, it's ok to fix it like this, right?

@Technoboy- @graysonzeng I have shared more context in #22110 about the "double release" bug pattern.

lhotari avatar Feb 23 '24 20:02 lhotari

it looks like ML read entries from an already closed ledger, read entries from a closed ledger will lead to exception, and then, ML will try to close the ledger again.

I believe the key point is ML trying to read entries from a closed ledger.

dao-jun avatar Feb 24 '24 13:02 dao-jun

@graysonzeng could please provide the steps that I can reproduce the issue?

dao-jun avatar Feb 24 '24 13:02 dao-jun

@graysonzeng could please provide the steps that I can reproduce the issue?

This is an occasional error that occurs only once. I can't try to reproduce it @dao-jun

graysonzeng avatar Feb 28 '24 06:02 graysonzeng

@graysonzeng what's your pulsar version? could you please provide more log?

dao-jun avatar Feb 28 '24 10:02 dao-jun

Closing Handle multiple times does indeed occur frequently, because ledger#closeAsync is called by async thread when read entries failed, if there are some ongoing read operations, they will also try to read the ledger which is already closed.

and LedgerHandle has also taken measures to address it(https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L537). But offloader didn't, I'll create another PR to fix it.

dao-jun avatar Feb 29 '24 04:02 dao-jun

related PR: https://github.com/apache/pulsar/pull/22162

dao-jun avatar Feb 29 '24 05:02 dao-jun

Thanks for the fix, I'll close it once the related PR is merged @dao-jun

graysonzeng avatar Feb 29 '24 09:02 graysonzeng

the pr closed automatically since https://github.com/apache/pulsar/pull/22162 merged.

dao-jun avatar Feb 29 '24 13:02 dao-jun