solr SOLR-16163: Review unintended try-with-resources closing-order modifications

Apr 22 '22 19:04 magibney

I suspect the searcher and/or request being open after the core close will keep the core increment one higher and thus not actually closed until the searcher/request is.

I dug into this a bit more and it's quite tricky.

The problem manifests when the core.close() call happens before searcher.decref() ... but only if the core.close() call in question is the last call to close() for the given core. Until core.refCount becomes 0, core.close() is basically just bookkeeping; but the last core.close() proceeds to doClose(), which calls directoryFactory.close() -- the CachingDirectoryFactory implementation of which blocks waiting for its cached Directory instances to be released. The SolrIndexSearcher whose closing was reordered to follow core.close() actually holds a reference to the cached Directory instance, not the SolrCore itself, so the thread indeed leaks/resource is not closed.

Looking more closely at why this issue addressed by #815 was hard to reproduce: the bulk of the replication in TestReplicationHandler takes place via explicit fetchindex command -- almost always with the special test param wait=true, causing the associated replication fetch to take place (from start to finish) within the context/lifetime of the enclosing SolrQueryRequest, which itself caries a reference to the SolrCore, ensuring that the replication code will not be the last call to core.close(). So core.close() amounts to just a decrement of a refcount -- it doesn't block, and the searcher.decref() happens ok, even though out-of-order.

The problem happens when polling replication ("indexFetcher" thread) happens to hit around the time of core close -- all the other core references have been decremented, and the affected code is left holding the only remaining reference to the core and (via the SolrIndexSearcher) to the Directory instance. Polling replication is vulnerable (whereas explicit fetchindex command is not) because polling is a scheduled task that doesn't take place within the context of a SolrQueryRequest holding references to the SolrCore. Indeed, I verified that the problem addressed by #815 could be equally (though hackily) addressed by wrapping the polling doFetch() in a try-with-resources block artificially incrementing the refCount for the SolrCore!

I'm not sure whether there might be other cases that would trigger this, especially with other areas of the code; but I think the prudent thing to do for now would be to take the instances identified in this PR and restore the original close order -- i.e. SolrIndexSearcher.decref() (or SolrQueryRequest.close(), which in turn may call SolrIndexSearcher.decref()) before calling close() on the associated SolrCore.

Apr 28 '22 21:04 magibney

So my overall comment at looking at the underlying cause of the "out of order" is that it would be more clear to move the closes that order matters closer to where they should be closed (nested try w/ resources or otherwise) instead of trying to do multiple close in a finally.

May 05 '22 19:05 risdenk

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution!

Feb 20 '24 00:02 github-actions[bot]