Merge or Add data into "ondisk" indices corrupts index ids
Summary
When using on disk inverted lists indices (with ivfdata file), any of the methods:
- merge_from
- merge_into
- add_with_ids
will make indices corrupt.
Platform
Linux Ubuntu 22.04
Faiss version: 1.8.0
Installed from: pip
Faiss compilation options: n/a
Running on:
- [x] CPU
- [ ] GPU
Interface:
- [ ] C++
- [x] Python
Reproduction instructions
The tar.gz file has a self contained test that reproduces the problem, as well as a README explaining the issue and the test.
What the sample code does is to basically try to add data in a ondisk index using 3 methods, using inmemory index as baseline. In memory operations always succeed, while on disk operations always fail.
bug-report-faiss180-ondisk-update.tar.gz
Extract the below file and run the test:
python main.py
About the merge_into inconsistency: this is because the baseline_index (that manages index.ivfdata) is opened twice, once as baseline_index and once as empty_trained in add_data. Therefore there is an inconsistency between the ivfdata content and the index in one of the two cases. Also, it is not recommended to add vectors to an on-disk index, which is very slow. Instead, add vectors to one or several in-memory indexes and merge them afterwards, as in
https://github.com/facebookresearch/faiss/blob/e758973fa08164728eb9e136631fe6c57d7edf6c/demos/demo_ondisk_ivf.py
About the merge_into inconsistency: this is because the baseline_index (that manages index.ivfdata) is opened twice, once as baseline_index and once as empty_trained in add_data. Therefore there is an inconsistency between the ivfdata content and the index in one of the two cases. Also, it is not recommended to add vectors to an on-disk index, which is very slow. Instead, add vectors to one or several in-memory indexes and merge them afterwards, as in
https://github.com/facebookresearch/faiss/blob/e758973fa08164728eb9e136631fe6c57d7edf6c/demos/demo_ondisk_ivf.py
Thanks for checking on this @mdouze.
I dont think the baseline index is what is causing the issue.
I have simplified the test a lot, removed baseline indices, and Im using only merge_from to add new data from an in-memory index. Im also always isolating all indices now. The problem remains, this is the new simpler test:
bug-report-faiss180-ondisk-update-v2.tar.gz
Reproducing the issue is, in fact, very easy. Just open any ondisk index, and just add data to it using any preferred strategy. Then check the ids inside the index. About 13% of indices will get a negative values after insertion.
There is another report on the same issue.
Also, this issue is happening in my application by just loading a single healthy "ondisk" index and adding data to it.
About the add_index being slow, I understand that, but we do not have low latency requirement for adding data right now. I have no preference on the method for adding new data, as long as I can add new data to the on disk index.
I want to help investigating / solving this issue if possible. I understand that you are quite packed with other requests, so please let me know if I can help you creating some other tests to reproduce / debug.
Thanks