ContentStoreDirFixedBlock prevents threaded indexing

Open KCMertens opened this issue 8 years ago • 0 comments

This class is responsible for storing documents within an index on disk, for later retrieval of the original contents. An index can potentially contain many separate documents. Every document is stored as a series of compressed blocks on disk, where every block is 4k in size. A document is stored as a list of the blocks that make it up, and which part of the document each of those blocks contain.

This issue with the current implementation is twofold:

Firstly, because compression ratio is variable, a different amount of uncompressed data is required to fill every block with 4k of compressed data. Because of this, the next block cannot begin to be compressed before the previous block is finished, as it is unknown how much of the uncompressed data the previous block will use up. So blocks are essentially compressed serially.

Secondly, a document can be stored in parts (see ContentStoreDirFixedBlock#storePart) and written to disk as it's being processed. This requires some state about the current document to be kept within the ContentStore class. Namely how much of the document has already been stored, and which block indices/ids were used to do so. The consequence of this is that it is currently impossible to process multiple documents using the same instance of the ContentStore class, because there is state linked to the document in between calls to store()/storePart().

The current system has a couple of features that have to be considered in any changes to the system

Every block has some metadata (stored within the table of contents file) that stores the location/offset of its uncompressed data within the source document (see TocEntry#deserialize). This effectively allows random access to data within the file, because the block containing that bit of data can be found without having to actually read/process the block. Also, blocks can easily be reused, as a document is essentially just a list of pointers to some blocks, and every block is just a 4k piece of data within the disk file.

A couple ways to solve this:

Allow blocks to have a variable size.

To do this, a block would need to contain the following information:

offset on disk
length on disk
offset within the uncompressed document (new)
length within the uncompressed document (new)

Reading the old content store would still be possible, the offset and length within the document are already stored in the current system. The on-disk offset is the index of the block (index * block_size [4096]), and the length is constant at 4096 bytes.

Keep the current block system, but allow separate documents to be processed in parallel.

This can be done by moving the document-specific state (charsFromEntryWritten, bytesWritten, blockIndicesWhileStoring, blockCharOffsetsWhileStoring, unwrittenContents) out of the ContentStoreDirFixedBlock and into a sort of context class, together with that document's id. The context would be created when the document is first created, and passed into the store*() functions within the contentStore. The global data about the store file itself, such as freeBlocks, next(Block)Id, etc will need to be synchronized and kept within the store.

Get rid of the contentStore entirely, and research how to store the document data in Solr/Lucene.

This is probably the best long-term solution.

Dec 04 '17 10:12 KCMertens