accumulo Exception during scan due to cached block eviction (Possible recurrence of issue described in #774)

Bug Description It appears that we've encountered the same behavior as observed in #774 while testing our rel/2.0.1 deployment in AWS. For a period of time, we observed that our table scans would sporadically fail with the error below (maybe once or twice per hour over a period of a couple days of continuous query testing)

[scan.LookupTask] WARN : exception while doing multi-scan
java.lang.IllegalStateException: Block was evicted
        at org.apache.accumulo.core.file.blockfile.cache.lru.CachedBlock.recordSize(CachedBlock.java:159)
        at org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCache.cacheBlock(LruBlockCache.java:238)
        at org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCache.cacheBlock(LruBlockCache.java:263)
        at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:133)
        at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:409)
        at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.getIndexBlock(MultiLevelIndex.java:808)
        at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.access$100(MultiLevelIndex.java:585)
        at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.lookup(MultiLevelIndex.java:628)
        at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.access$400(MultiLevelIndex.java:591)
        at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.lookup(MultiLevelIndex.java:817)
        at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._seek(RFile.java:997)
        at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.seek(RFile.java:895)
        at org.apache.accumulo.core.iterators.system.LocalityGroupIterator.seek(LocalityGroupIterator.java:281)
        at org.apache.accumulo.core.file.rfile.RFile$Reader.seek(RFile.java:1437)
        at org.apache.accumulo.server.problems.ProblemReportingIterator.seek(ProblemReportingIterator.java:103)
        at org.apache.accumulo.core.iterators.system.MultiIterator.seek(MultiIterator.java:104)
        at org.apache.accumulo.core.iterators.system.StatsIterator.seek(StatsIterator.java:63)
        at org.apache.accumulo.core.iterators.system.DeletingIterator.seek(DeletingIterator.java:73)
        at org.apache.accumulo.core.iterators.ServerSkippingIterator.seek(ServerSkippingIterator.java:52)
        at org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator.seek(ColumnFamilySkippingIterator.java:127)
        at org.apache.accumulo.core.iterators.SynchronizedServerFilter.seek(SynchronizedServerFilter.java:56)
        at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:99)
        at org.apache.accumulo.core.iterators.user.VersioningIterator.seek(VersioningIterator.java:82)
        at aquery.iterators.Query.get_pageid(Query.java:453)
        at aquery.iterators.Query.next(Query.java:351)
        at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:174)
        at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.next(SourceSwitchingIterator.java:145)
        at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:641)
        at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:770)
        at org.apache.accumulo.tserver.scan.LookupTask.run(LookupTask.java:116)
        at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:54)
        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
        at java.lang.Thread.run(Thread.java:748)

Versions:

Accumulo: 2.0.1
OS: CentOS 7.7.1908
Java: OpenJDK 1.8.0_272
Hadoop: 3.0.0-cdh6.3.1
ZK: 3.4.14

Cluster Environment:

12-node m5.2xlarge ec2 cluster; 10 tservers; 3 ZKs; Hadoop HA namenodes
LRU cache config:

default    | tserver.cache.data.size ........................... | 10%
site       |    @override ...................................... | 2G
system     |    @override ...................................... | 4G
default    | tserver.cache.index.size .......................... | 25%
site       |    @override ...................................... | 2G
system     |    @override ...................................... | 1G
default    | tserver.cache.manager.class ....................... | org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCacheManager

Other relevant facts:

As best we can tell, the errors began to surface several hours after having switched tserver.cache.manager.class (via accumulo shell) from accumulo.ohc.OhcCacheManager back to LruBlockCacheManager
No intervening restart of Accumulo was performed after said config change in the shell
At the time of the errors, the cluster was under medium-heavy load from concurrent user queries/scans and live ingest
After a recent full restart of Accumulo, the errors seem to have ceased

There is no evidence in logs of this error having occurred prior to this instance (going back over a month), and several hours have lapsed since the full restart of Accumulo. No recurrence so far

Mar 18 '21 20:03 keith-ratcliffe

Update: Apparently I misstated our cache config changes above. Sorry about that (OHC wasn't involved, although we had been testing that recently)

LRU cache had already been re-enabled, so the sequence of events was actually...

We were testing with LRU cache manager configured, but with data and index cache both disabled
Index cache was enabled via the shell followed by a few hours of testing. No errors observed during this time
Data cache was then enabled via the shell, with several more hours of testing overnight. Errors became apparent the next day upon review and continued up until a restart of Accumulo was performed
No additional errors have been observed since the restart

Mar 19 '21 20:03 keith-ratcliffe

We'll try to replicate this and collect more info as well

Mar 19 '21 20:03 keith-ratcliffe

@keith-ratcliffe We could probably easily prevent the error if we disallowed cache manager reconfiguration without a restart of the process, like we do with some other properties. How detrimental would that naive solution be to your use case?

Mar 19 '21 22:03 ctubbsii

@keith-ratcliffe We could probably easily prevent the error if we disallowed cache manager reconfiguration without a restart of the process, like we do with some other properties. How detrimental would that naive solution be to your use case?

Makes sense. Not detrimental at all, really.

Mar 20 '21 00:03 keith-ratcliffe

I was looking into this, and I think I was wrong about the preventative measure. It looks like it is already the case that none of these properties will have an effect until the tserver is restarted anyway. The only one that could affect things is enabling or disabling the use of the cache, and that is a per-table setting that should be possible to turn on and off at runtime.

Looking into the stack trace a bit more, I started wondering if the cacheId wasn't unique. Unfortunately, the exceptions don't include the blockName, which they should.

Additionally, I saw that at least in one case, SummaryReader.load(), there is no cacheId being set at all, so it should be null, which seems prone to error, since that's clearly not unique. The CacheableBuilder object looks like a builder pattern, but doesn't enforce required fields, and doesn't have any validation function to ensure it was used correctly, so cacheId could easily be set to null. However, even if that is a problem for SummaryReader.load() (@keith-turner implemented summaries, so he might have some insight into that), it doesn't seem like it would cause the stack trace reported in this issue, which doesn't mention summaries at all.

If the problem is the cacheId is not unique (null or otherwise), then the block name looked up could have lots of collisions for the same offset (in CacheableBlockFile.Reader.getMetaBlock(), the block name is determined with String _lookup = this.cacheId + "R" + offset;).

There are a few things that can/should be done:

Improve the exception messages to include the blockName to make debugging this easier; @keith-ratcliffe is this issue reproducible? If so, what blockName is printed with the IllegalStateExceptions? (Edit: #2163)
Ensure CacheableBuilder enforces the requirement for a non-null cacheId by adding a build() function with validation, like a true builder pattern (might not fix the problem) (Edit: #2160)
Can document these properties as "fixed" properties in ZK (would help with documentation only... would not help with this bug, since the code that sets up the cache manager is only executed once at startup). (Edit: #2164)

While digging into this, I noticed there were a few classes in the stack trace that I don't have the source to to track down further.

Jun 11 '21 03:06 ctubbsii

Ensure CacheableBuilder enforces the requirement for a non-null cacheId by adding a build() function with validation, like a true builder pattern (might not fix the problem)

I will look into this.

Jun 11 '21 11:06 milleruntime

I added a check for null cacheId in https://github.com/apache/accumulo/pull/2160 but like you said, not sure that will fix the underlying problem.

Jun 11 '21 14:06 milleruntime

I added the blockName to the exception message in #2163 ; it doesn't fix anything, but if this issue can be reproduced, it should help debug it.

Jun 14 '21 14:06 ctubbsii

I also created #2164 to document the cache properties that require restart to take effect.

Jun 14 '21 15:06 ctubbsii