Exception during scan due to cached block eviction (Possible recurrence of issue described in #774)
Bug Description It appears that we've encountered the same behavior as observed in #774 while testing our rel/2.0.1 deployment in AWS. For a period of time, we observed that our table scans would sporadically fail with the error below (maybe once or twice per hour over a period of a couple days of continuous query testing)
[scan.LookupTask] WARN : exception while doing multi-scan
java.lang.IllegalStateException: Block was evicted
at org.apache.accumulo.core.file.blockfile.cache.lru.CachedBlock.recordSize(CachedBlock.java:159)
at org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCache.cacheBlock(LruBlockCache.java:238)
at org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCache.cacheBlock(LruBlockCache.java:263)
at org.apache.accumulo.core.file.blockfile.cache.lru.SynchronousLoadingBlockCache.getBlock(SynchronousLoadingBlockCache.java:133)
at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:409)
at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.getIndexBlock(MultiLevelIndex.java:808)
at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.access$100(MultiLevelIndex.java:585)
at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.lookup(MultiLevelIndex.java:628)
at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader$Node.access$400(MultiLevelIndex.java:591)
at org.apache.accumulo.core.file.rfile.MultiLevelIndex$Reader.lookup(MultiLevelIndex.java:817)
at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader._seek(RFile.java:997)
at org.apache.accumulo.core.file.rfile.RFile$LocalityGroupReader.seek(RFile.java:895)
at org.apache.accumulo.core.iterators.system.LocalityGroupIterator.seek(LocalityGroupIterator.java:281)
at org.apache.accumulo.core.file.rfile.RFile$Reader.seek(RFile.java:1437)
at org.apache.accumulo.server.problems.ProblemReportingIterator.seek(ProblemReportingIterator.java:103)
at org.apache.accumulo.core.iterators.system.MultiIterator.seek(MultiIterator.java:104)
at org.apache.accumulo.core.iterators.system.StatsIterator.seek(StatsIterator.java:63)
at org.apache.accumulo.core.iterators.system.DeletingIterator.seek(DeletingIterator.java:73)
at org.apache.accumulo.core.iterators.ServerSkippingIterator.seek(ServerSkippingIterator.java:52)
at org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator.seek(ColumnFamilySkippingIterator.java:127)
at org.apache.accumulo.core.iterators.SynchronizedServerFilter.seek(SynchronizedServerFilter.java:56)
at org.apache.accumulo.core.iterators.WrappingIterator.seek(WrappingIterator.java:99)
at org.apache.accumulo.core.iterators.user.VersioningIterator.seek(VersioningIterator.java:82)
at aquery.iterators.Query.get_pageid(Query.java:453)
at aquery.iterators.Query.next(Query.java:351)
at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.readNext(SourceSwitchingIterator.java:174)
at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.next(SourceSwitchingIterator.java:145)
at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:641)
at org.apache.accumulo.tserver.tablet.Tablet.lookup(Tablet.java:770)
at org.apache.accumulo.tserver.scan.LookupTask.run(LookupTask.java:116)
at org.apache.accumulo.tserver.session.ScanSession$ScanMeasurer.run(ScanSession.java:54)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:748)
Versions:
- Accumulo: 2.0.1
- OS: CentOS 7.7.1908
- Java: OpenJDK 1.8.0_272
- Hadoop: 3.0.0-cdh6.3.1
- ZK: 3.4.14
Cluster Environment:
- 12-node m5.2xlarge ec2 cluster; 10 tservers; 3 ZKs; Hadoop HA namenodes
- LRU cache config:
default | tserver.cache.data.size ........................... | 10%
site | @override ...................................... | 2G
system | @override ...................................... | 4G
default | tserver.cache.index.size .......................... | 25%
site | @override ...................................... | 2G
system | @override ...................................... | 1G
default | tserver.cache.manager.class ....................... | org.apache.accumulo.core.file.blockfile.cache.lru.LruBlockCacheManager
Other relevant facts:
- As best we can tell, the errors began to surface several hours after having switched
tserver.cache.manager.class(via accumulo shell) from accumulo.ohc.OhcCacheManager back to LruBlockCacheManager - No intervening restart of Accumulo was performed after said config change in the shell
- At the time of the errors, the cluster was under medium-heavy load from concurrent user queries/scans and live ingest
- After a recent full restart of Accumulo, the errors seem to have ceased
There is no evidence in logs of this error having occurred prior to this instance (going back over a month), and several hours have lapsed since the full restart of Accumulo. No recurrence so far
Update: Apparently I misstated our cache config changes above. Sorry about that (OHC wasn't involved, although we had been testing that recently)
LRU cache had already been re-enabled, so the sequence of events was actually...
- We were testing with LRU cache manager configured, but with data and index cache both disabled
- Index cache was enabled via the shell followed by a few hours of testing. No errors observed during this time
- Data cache was then enabled via the shell, with several more hours of testing overnight. Errors became apparent the next day upon review and continued up until a restart of Accumulo was performed
- No additional errors have been observed since the restart
We'll try to replicate this and collect more info as well
@keith-ratcliffe We could probably easily prevent the error if we disallowed cache manager reconfiguration without a restart of the process, like we do with some other properties. How detrimental would that naive solution be to your use case?
@keith-ratcliffe We could probably easily prevent the error if we disallowed cache manager reconfiguration without a restart of the process, like we do with some other properties. How detrimental would that naive solution be to your use case?
Makes sense. Not detrimental at all, really.
I was looking into this, and I think I was wrong about the preventative measure. It looks like it is already the case that none of these properties will have an effect until the tserver is restarted anyway. The only one that could affect things is enabling or disabling the use of the cache, and that is a per-table setting that should be possible to turn on and off at runtime.
Looking into the stack trace a bit more, I started wondering if the cacheId wasn't unique. Unfortunately, the exceptions don't include the blockName, which they should.
Additionally, I saw that at least in one case, SummaryReader.load(), there is no cacheId being set at all, so it should be null, which seems prone to error, since that's clearly not unique. The CacheableBuilder object looks like a builder pattern, but doesn't enforce required fields, and doesn't have any validation function to ensure it was used correctly, so cacheId could easily be set to null. However, even if that is a problem for SummaryReader.load() (@keith-turner implemented summaries, so he might have some insight into that), it doesn't seem like it would cause the stack trace reported in this issue, which doesn't mention summaries at all.
If the problem is the cacheId is not unique (null or otherwise), then the block name looked up could have lots of collisions for the same offset (in CacheableBlockFile.Reader.getMetaBlock(), the block name is determined with String _lookup = this.cacheId + "R" + offset;).
There are a few things that can/should be done:
- Improve the exception messages to include the
blockNameto make debugging this easier; @keith-ratcliffe is this issue reproducible? If so, what blockName is printed with the IllegalStateExceptions? (Edit: #2163) - Ensure
CacheableBuilderenforces the requirement for a non-nullcacheIdby adding abuild()function with validation, like a true builder pattern (might not fix the problem) (Edit: #2160) - Can document these properties as "fixed" properties in ZK (would help with documentation only... would not help with this bug, since the code that sets up the cache manager is only executed once at startup). (Edit: #2164)
While digging into this, I noticed there were a few classes in the stack trace that I don't have the source to to track down further.
Ensure CacheableBuilder enforces the requirement for a non-null cacheId by adding a build() function with validation, like a true builder pattern (might not fix the problem)
I will look into this.
I added a check for null cacheId in https://github.com/apache/accumulo/pull/2160 but like you said, not sure that will fix the underlying problem.
I added the blockName to the exception message in #2163 ; it doesn't fix anything, but if this issue can be reproduced, it should help debug it.
I also created #2164 to document the cache properties that require restart to take effect.