Incremental Index Maintenance for File/Partition Mutable Datasets
Please use the template below (default for Proposal Issues)
Describe the problem Support for Index Maintenance for mutable datasets.
Mutable data comprises of appended data (new data added to existing data) and deleted data (files or partitions removed from existing data).
Problem: How do we keep the index up to date with updated data (either appended or deleted or both)?
Describe your proposed solution We plan to expose the following apis to solve this problem:
-
hyperspace.refreshIndex(indexName, mode = "quick"): For Deleted Data: This api handles a "quick" delete version of index refresh by storing the list of deleted files in the metadata. For Appended Data: This api creates additional index data on just the newly appended data files. -
hyperspace.refreshIndex(indexName, mode = "full"): recreates full index from scratch. This doesn't need separate handling of deleted and appended data. -
hyperspace.optimizeIndex(indexName, mode = "quick"): This optimization api works directly on the indexes by compacting large number of small index files into small number of larger files. -
hyperspace.optimizeIndex(indexName, mode = "full"): This optimization api works directly on the indexes. Contrary to the "quick" mode optimize, this api takes all the index files, not just the smaller ones, and compacts them into larger files. -
hyperspace.refreshIndex(indexName, mode = "smart"): For Deleted Data: This api removes index records from deleted source data files. For Appended Data: This api creates additional index data with smart optimization of previously created small index files.
Describe the issue
The whole process will be broken down into the following order, to allow parallel development of various functionalities
- [x] Delete Support: Add support for delete to index refresh #133, #142 @pirz
- [x] Append Support: Incremental Indexing: Support for incremental changes to index when new data is added #29 @apoorvedave1
- [x] Merge append and delete into seamless api: index refresh #105, #149, PR #187 @apoorvedave1 @pirz
- [x] Enforce delete during read time #134 @pirz
- [x] Quick/Full Optimize support for append-only data ISSUE: #111, PR: #166 @apoorvedave1
- [ ] Implement optimized smart refresh indexes of append-only data #112 @apoorvedave1
@apoorvedave1 Could you please explain why incremental = "true" is needed in the API? Can we do away with just the mode parameter? Thank you!
@apoorvedave1 @pirz @sezruby @imback82 Let us use this as the uber-issue to track all the work related to getting this work.
@apoorvedave1 Can you kindly coordinate with everyone and update the description to capture all issues related to this?
@pirz @apoorvedave1 There are multiple issues currently: #133 #105 #149 Can we do the following?
Let us update this issue to it capture the entire work e2e. Have an issue for the individual lines. At the moment, this issue only covers append-only but I think it's better to make this the uber issue to capture appends and deletes (to reflect the title).
Once we are done with this, we can close some of the issues I linked to in the first line.
@pirz @apoorvedave1 Thanks for updating the issue! Can we close anything that's no longer relevant?