herddb
herddb copied to clipboard
Inefficient recovery of BRIN indexes
Another flavour of #127
during recovery of a large table with BRIN indexes we can see BRIN index checkpoints, with an fsync (force0)
hdb-mucca-activator" #29 daemon prio=5 os_prio=0 cpu=731142,82ms elapsed=1803,92s tid=0x00007f5a5d51c800 nid=0x168ba runnable [0x00007f59d58d4000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.force0([email protected]/Native Method)
at sun.nio.ch.FileDispatcherImpl.force([email protected]/FileDispatcherImpl.java:82)
at sun.nio.ch.FileChannelImpl.force([email protected]/FileChannelImpl.java:461)
at herddb.utils.ManagedFile.sync(ManagedFile.java:115)
at herddb.file.FileDataStorageManager.writeIndexPage(FileDataStorageManager.java:910)
at herddb.file.FileDataStorageManager.writeIndexPage(FileDataStorageManager.java:938)
at herddb.index.brin.BRINIndexManager$IndexDataStorageImpl.createDataPage(BRINIndexManager.java:470)
at herddb.index.brin.BlockRangeIndex$Block.checkpointNoLock(BlockRangeIndex.java:665)
at herddb.index.brin.BlockRangeIndex$Block.checkpoint(BlockRangeIndex.java:679)
at herddb.index.brin.BlockRangeIndex$Block.unloadNoLock(BlockRangeIndex.java:691)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:702)
at herddb.index.brin.BlockRangeIndex$Block.unload(BlockRangeIndex.java:730)
at herddb.index.brin.BlockRangeIndex$Block.ensureBlockLoaded(BlockRangeIndex.java:499)
at herddb.index.brin.BlockRangeIndex$Block.delete(BlockRangeIndex.java:380)
at herddb.index.brin.BlockRangeIndex.delete(BlockRangeIndex.java:1055)
at herddb.index.brin.BRINIndexManager.recordDeleted(BRINIndexManager.java:421)
at herddb.core.TableManager.applyDelete(TableManager.java:1677)
at herddb.core.TableManager.onTransactionCommit(TableManager.java:1467)
at herddb.core.TableSpaceManager.apply(TableSpaceManager.java:368)
at herddb.core.TableSpaceManager$ApplyEntryOnRecovery.accept(TableSpaceManager.java:1858)
at herddb.core.TableSpaceManager$ApplyEntryOnRecovery.accept(TableSpaceManager.java:1847)
at herddb.cluster.BookkeeperCommitLog.recovery(BookkeeperCommitLog.java:477)
at herddb.core.TableSpaceManager.recover(TableSpaceManager.java:299)
In the log you can see pauses of about 21 seconds like:
20-01-15-13-27-09 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:09 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,077,207, to entry 72,087,207 (95.85 %) read time 21085 ms
20-01-15-13-27-09 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:09 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,087,208, to entry 72,097,208 (95.863 %) read time 93 ms
20-01-15-13-27-24 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:24 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,097,209, to entry 72,107,209 (95.877 %) read time 14745 ms
20-01-15-13-27-24 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:24 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,107,210, to entry 72,117,210 (95.89 %) read time 72 ms
20-01-15-13-27-24 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:24 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,117,211, to entry 72,127,211 (95.903 %) read time 90 ms
20-01-15-13-27-45 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:45 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,127,212, to entry 72,137,212 (95.917 %) read time 21274 ms
20-01-15-13-27-45 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:45 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,137,213, to entry 72,147,213 (95.93 %) read time 90 ms
20-01-15-13-27-45 herddb.cluster.BookkeeperCommitLog Jan 15, 2020 1:27:45 PM herddb.cluster.BookkeeperCommitLog recovery
INFO: q2 (36f8b443df7e459682d26c97246d47dd) From entry 72,147,214, to entry 72,157,214 (95.943 %) read time 86 ms
In such case the brin index need to unload a page with dirty record so it has to flush changed date to disk. The same procedure is invoked on unload and on index checkpoint. Force has to be true to be sure to have really written data.
Practically speaking it wouldn't need a force if not during a real index checkpoint but once flushed with no force a page I don't know hot to ensure that has been really written during a later checkpoint