Add a Better Binary Quantizer (RaBitQ) format for dense vectors
Highlevel design
RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).
The euclidean vectors as stored in the index:
| quantized vector | distance_to_centroid | vector magnitude |
|---|---|---|
| (vector_dimension/8) bytes | float | float |
For dot-product vectors:
| quantized vector | vector dot-product with binarized self | vector magnitude | centroid dot-product |
|---|---|---|---|
| (vector_dimension/8) bytes | float | float | float |
The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).
For indexing into HNSW we actually have a multi-step process. RaBitQ encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.
We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.
closes: https://github.com/apache/lucene/issues/13650
@benwtrent
possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping
I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:
- to keep the current approach (add a byte at the end of each vector in the vectors file). Indeed, this may throw off paging size, but may be effect on memory mapped files is not big?
- adding an extra file for centroids mapping. Centroids mapping can be accessed through memory mapped file, or loaded directly into memory on first use.
What do you think?
For now, we keep adopting the 1st (current) approach.
100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store.
Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution.
But, either solution needs testing for sure.
@benwtrent we are trying to run tests with the RaBitQ Lucene implementation, and are not able to replicate the numbers reported in the paper. Have you run tests as part of the implementation, and if so, could you please point me to the location? We want to see if we have made an error somewhere. Thanks.
@tanyaroosta we are still doing larger scale testing, but if you want to test with LuceneUtil, here is the branch I am using: https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq?expand=1
2-5x oversampling is required with HNSW to get good recall.
The original paper does periodic rescoring with the raw vectors which effectively means you need random access with raw vectors, even when NOT using HNSW. This just doesn't scale when trying to reduce memory usage.
Instead, we are opting for a more traditional approach of oversampling and reranking.
Following up on the above comment by tanyaroosta, the dataset I was using for benchmarking RaBitQ through Luceneutil (main branch) was amazon's ASIN and query embeddings (which are 256 dim float vectors). Here are three sets of results I got:
Normal HNSW
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.721 0.128 1439443 10 6 32 40 no 81.51 42.58 1 1455.25
0.828 0.166 1439443 10 6 64 100 no 202.30 78.03 1 1472.66
0.886 0.202 1439443 10 6 64 250 no 545.83 242.29 1 1493.95
0.914 0.227 1439443 10 6 64 500 no 1111.43 452.62 1 1511.73
RaBitQ HNSW (same parameters) - codec Lucene912HnswBinaryQuantizedVectorsFormat
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.448 0.095 1439443 10 6 32 40 no 86.15 45.45 1 1526.72
0.522 0.130 1439443 10 6 64 100 no 195.14 89.08 1 1555.16
0.542 0.167 1439443 10 6 64 250 no 524.03 207.12 1 1589.50
0.547 0.192 1439443 10 6 64 500 no 1035.65 415.44 1 1618.38
Only RaBitQ - codec Lucene912BinaryQuantizedVectorsFormat
Results:
recall latency (ms) nDoc topK fanout maxConn beamWidth quantized index s force merge s num segments index size (MB)
0.000 87.658 1439443 10 100 64 40 no 12.39 8.01 1 1471.69
0.000 87.830 1439443 100 100 64 40 no 12.53 7.78 1 1471.69
The first issue we're facing is the large recall reduction for RaBitQ HNSW when compared to pure HNSW. Is this recall regression expected? Secondly, the pure RaBitQ implementation returns 0 recall, which is definitely suspect. Perhaps there is a bug on my end when trying to benchmark it using luceneutil?
@benwtrent I'll also try working with your branch of luceneutil to see if it changes the results
@ShashwatShivam so, the flat codec version is sneaky, depending on when you cloned the repo, it might not be doing anything
Lucene by default will return nothing for approx kann over a flat index (I think this should be fixed).
But I think the latest BBQ branch should have this patched. I can check in a little while
Here is some more flat index test results. This was to exercise and see how the number of coarse grained centroids changes recall & speed.
| Lucene912BinaryQuantizedVectorsFormat | recall@100 | recall@100|500 | recall@100|1000 | index time & force-merge | mean search time (brute-force) |
|---|---|---|---|---|---|
| hotpotqa-E5-small (5233329 vectors, 1 centroid) | 47 | 81 | 90 | 79618ms | 177ms |
| hotpotqa-E5-small (5233329 vectors, 255 centroids) | 52 | 86 | 94 | 267522ms | 211ms |
| hotpotqa-gte-base (5233329 vectors, 1 centroid) | 66 | 96 | 99 | 214254ms | 227ms |
| hotpotqa-gte-base (5233329 vectors, 255 centroids) | 71 | 98 | 99 | 617887ms | 275ms |
| dbpedia-entity-arctic (4635922 vectors, 1 centroid) | 56 | 89 | 95 | 199827ms | 205ms |
| dbpedia-entity-arctic (4635922 vectors, 255 centroid) | 59 | 91 | 96 | 618797ms | 220ms |
Models used:
- https://huggingface.co/thenlper/gte-base
- https://huggingface.co/intfloat/e5-small
- https://huggingface.co/Snowflake/snowflake-arctic-embed-m
I am currently working on moving this to Lucene101 format with the bug fixes we discovered in additional testing.
I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review.
Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :)
Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice.
The oversampling rates were [1, 1.5, 2, 3, 4, 5]. HNSW params m=16,efsearch=100. Recall@100.
Cohere v2 1M
| quantization | Index Time | Force Merge time | Mem Required |
|---|---|---|---|
| 1 bit | 395.18 | 411.67 | 175.9MB |
| 4 bit (compress) | 1877.47 | 491.13 | 439.7MB |
| 7 bit | 500.59 | 820.53 | 833.9MB |
| raw | 493.44 | 792.04 | 3132.8MB |
Cohere v3 1M
1M Cohere v3 1024
| quantization | Index Time | Force Merge time | Mem Required |
|---|---|---|---|
| 1 bit | 338.97 | 342.61 | 208MB |
| 4 bit (compress) | 1113.06 | 5490.36 | 578MB |
| 7 bit | 437.63 | 744.12 | 1094MB |
| raw | 408.75 | 798.11 | 4162MB |
e5Small
| quantization | Index Time | Force Merge time | Mem Required |
|---|---|---|---|
| 1 bit | 161.84 | 42.37 | 57.6MB |
| 4 bit (compress) | 665.54 | 660.33 | 123.2MB |
| 7 bit | 267.13 | 89.99 | 219.6MB |
| raw | 249.26 | 77.81 | 793.5MB |
Hey @ShashwatShivam https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq
that is the testing script I use.
But if Lucene has since been updated with a 101 codec, I would need to update this branch.
@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before?
@ShashwatShivam why do you think the index size (total size of all the files) should be smaller?
We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only.
I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well.
@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw?
@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.
Hey @benwtrent, Thank you for all your help so far! I have a question about the oversampling used to increase recall. From what I understand, it scales up the top-k and fanout values by the oversampling factor. In the final match set, do we return only the best top-k documents (not scaled up, but the original value)? I couldn't locate the code where the reranking or selection of the best k results from the expanded match set happens. Could you please help me find that part? Thanks again!
@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.
Once this cool change is merged let's fix luceneutil's KNN benchy tooling (knnPerfTest.py, KnnGraphTester.java) to compute/report the "memory column" ("hot RAM", "searchable RAM", something)? Basically everything except the original (float32 or byte) vectors. I'll open an upstream luceneutil issue...
Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review".
@mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists).
@ShashwatShivam
Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how.
It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector...
I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility:
-
Set up the luceneutil repository following the installation instructions provided.
-
Switch branches to this specific branch since the latest mainline branch is not compatible with the feature needed for this experiment.
-
Change the branch of
lucene_candidateto benwtrent:feature/adv-binarization-format to incorporate advanced binarization formats. -
Run
knnPerfTest.pyafter specifying the document and query file paths to the stored Cohere data files. The runtime parameters were set as follows:-
nDoc = 500,000 -
topk = 10 -
fanout = 100 -
maxConn = 32 -
beamWidth = 100 -
oversamplevalues tested:{1, 1.5, 2, 3, 4, 5}
I used
quantizeBits = 1for RaBitQ+HNSW andquantizeBits = 32for regular HNSW. -
A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image:
.
FYI, a blog post on RaBitQ:
https://dev.to/gaoj0017/quantization-in-the-counterintuitive-high-dimensional-space-4feg
Thanks, Tanya @tanyaroosta , for sharing our blog about RaBitQ in this thread. I am the first author of the RaBitQ paper. I am glad to know that our RaBitQ method has been discussed in the threads here. Regarding the BBQ (Better Binary Quantization) method mentioned in these threads, my understanding is that it majorly follows the framework of RaBitQ and makes some minor modifications for practical performance consideration. The claimed key features of BBQ as described in a blog from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch - e.g., normalization around a centroid, multiple error correction values, asymmetric quantization, bit-wise operations, all originate from our RaBitQ paper.
We note that it is quite often that the industry customizes some methods from academia to better suit their applications, but the industry rarely gives the variant a new name and claim as a new method. For example, the PQ and HNSW methods are from academia and have been widely adopted in the industry with some modifications, but the industry still respects their original names. We believe the same practice should be followed for RaBitQ.
In addition, we would like to share that we have extended RaBitQ to support quantization beyond 1-bit per dimension (e.g., 2-bit, 3-bit, …). The paper of the extended RaBitQ was made available in Sep 2024. It achieves so by constructing a larger codebook than that of RaBitQ and can be equivalently understood as an optimized scalar quantization method. For details, please refer to the paper and also a blog that we have recently posted.
Closing this PR in deference to this one: https://github.com/apache/lucene/pull/14078
An evolution of scalar quantization proved more flexible and provided better recall in our experiments.