Support caching of DataHandles
The current implementation of AbstractChecker.isFormat() (which is heavily used during file opening via SCIFIO) uses DataHandleService.readBuffer(Location) to retrieve a DataHandle for a Location (see AbstractChecker.java#L97). This handle is subsequently used for checks if a Format is suitable for the data source at the specified location via isFormat(DataHandle). As it stands now, DataHandleService.readBuffer(Location) creates a new DataHandle for every invocation.
This is an issue in particular for HTTPHandles, where the length of a file, as computed by HTTPHandle.length(), is stored as a field of the handle. Storing the length significantly speeds up validation, because otherwise each call to HTTPHandle.length() would require an HTTP connection to be initiated and the response to be parsed (see e.g. FormatTools.java#L736).
Talking about this with @gab1one in person, we came up with the suggestion of having a dedicated DataHandleCacheService that can act as a central cache of DataHandles. Together with changes to AbstractChecker.java#L97 to first check a cache (implemented as a simple HashMap in my tests), it even is possible to open images from the handle to determine the correct format. This whole process takes about 1-2 seconds in my tests.
Here is what I would like your thoughts on, @ctrueden, @gab1one: Why shouldn't DefaultDataHandleService itself cache DataHandles?
Here is what I would like your thoughts on, @ctrueden, @gab1one: Why shouldn't DefaultDataHandleService itself cache DataHandles?
I believe this is only safe for read only handles used by code that always ensures the offset is set correctly before reading from the handle. When handles are written to or otherwise modified, they are not safe for re-use. I think being explicit about the fact that a handle might be cached is the better way to handle this.
We could potentially just rely on CacheService instead of build a dedicated Service.
We do need to make sure a handle is only ever used by one process at a time, CacheService does not guarantee this.
I'd be OK with API in DataHandleService specifying that caching is OK or not. Like boolean reuseHandles or something. I have no strong opinion without scrutinizing in more detail.