Highlevel design

RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).

The euclidean vectors as stored in the index:

quantized vector	distance_to_centroid	vector magnitude
(vector_dimension/8) bytes	float	float

For dot-product vectors:

quantized vector	vector dot-product with binarized self	vector magnitude	centroid dot-product
(vector_dimension/8) bytes	float	float	float

The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).

For indexing into HNSW we actually have a multi-step process. RaBitQ encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.

We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.

closes: https://github.com/apache/lucene/issues/13650

Aug 13 '24 19:08 benwtrent

@benwtrent

possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping

I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:

to keep the current approach (add a byte at the end of each vector in the vectors file). Indeed, this may throw off paging size, but may be effect on memory mapped files is not big?
adding an extra file for centroids mapping. Centroids mapping can be accessed through memory mapped file, or loaded directly into memory on first use.

What do you think?

For now, we keep adopting the 1st (current) approach.

Aug 21 '24 18:08 mayya-sharipova

100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store.

Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution.

But, either solution needs testing for sure.

Aug 21 '24 19:08 benwtrent

@benwtrent we are trying to run tests with the RaBitQ Lucene implementation, and are not able to replicate the numbers reported in the paper. Have you run tests as part of the implementation, and if so, could you please point me to the location? We want to see if we have made an error somewhere. Thanks.

Sep 17 '24 15:09 tanyaroosta

@tanyaroosta we are still doing larger scale testing, but if you want to test with LuceneUtil, here is the branch I am using: https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq?expand=1

2-5x oversampling is required with HNSW to get good recall.

The original paper does periodic rescoring with the raw vectors which effectively means you need random access with raw vectors, even when NOT using HNSW. This just doesn't scale when trying to reduce memory usage.

Instead, we are opting for a more traditional approach of oversampling and reranking.

Sep 17 '24 15:09 benwtrent

Following up on the above comment by tanyaroosta, the dataset I was using for benchmarking RaBitQ through Luceneutil (main branch) was amazon's ASIN and query embeddings (which are 256 dim float vectors). Here are three sets of results I got:

Normal HNSW

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.721         0.128  1439443    10       6       32         40         no    81.51          42.58             1          1455.25
 0.828         0.166  1439443    10       6       64        100         no   202.30          78.03             1          1472.66
 0.886         0.202  1439443    10       6       64        250         no   545.83         242.29             1          1493.95
 0.914         0.227  1439443    10       6       64        500         no  1111.43         452.62             1          1511.73

RaBitQ HNSW (same parameters) - codec Lucene912HnswBinaryQuantizedVectorsFormat

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.448         0.095  1439443    10       6       32         40         no    86.15          45.45             1          1526.72
 0.522         0.130  1439443    10       6       64        100         no   195.14          89.08             1          1555.16
 0.542         0.167  1439443    10       6       64        250         no   524.03         207.12             1          1589.50
 0.547         0.192  1439443    10       6       64        500         no  1035.65         415.44             1          1618.38

Only RaBitQ - codec Lucene912BinaryQuantizedVectorsFormat

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.000        87.658  1439443    10     100       64         40         no    12.39           8.01             1          1471.69
 0.000        87.830  1439443   100     100       64         40         no    12.53           7.78             1          1471.69

The first issue we're facing is the large recall reduction for RaBitQ HNSW when compared to pure HNSW. Is this recall regression expected? Secondly, the pure RaBitQ implementation returns 0 recall, which is definitely suspect. Perhaps there is a bug on my end when trying to benchmark it using luceneutil?

@benwtrent I'll also try working with your branch of luceneutil to see if it changes the results

Sep 17 '24 15:09 ShashwatShivam

@ShashwatShivam so, the flat codec version is sneaky, depending on when you cloned the repo, it might not be doing anything

Lucene by default will return nothing for approx kann over a flat index (I think this should be fixed).

But I think the latest BBQ branch should have this patched. I can check in a little while

Sep 17 '24 16:09 benwtrent

Here is some more flat index test results. This was to exercise and see how the number of coarse grained centroids changes recall & speed.

Lucene912BinaryQuantizedVectorsFormat	recall@100	recall@100\|500	recall@100\|1000	index time & force-merge	mean search time (brute-force)
hotpotqa-E5-small (5233329 vectors, 1 centroid)	47	81	90	79618ms	177ms
hotpotqa-E5-small (5233329 vectors, 255 centroids)	52	86	94	267522ms	211ms
hotpotqa-gte-base (5233329 vectors, 1 centroid)	66	96	99	214254ms	227ms
hotpotqa-gte-base (5233329 vectors, 255 centroids)	71	98	99	617887ms	275ms
dbpedia-entity-arctic (4635922 vectors, 1 centroid)	56	89	95	199827ms	205ms
dbpedia-entity-arctic (4635922 vectors, 255 centroid)	59	91	96	618797ms	220ms

Models used:

https://huggingface.co/thenlper/gte-base
https://huggingface.co/intfloat/e5-small
https://huggingface.co/Snowflake/snowflake-arctic-embed-m

Sep 18 '24 19:09 benwtrent

I am currently working on moving this to Lucene101 format with the bug fixes we discovered in additional testing.

Oct 16 '24 16:10 benwtrent

I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review.

Oct 18 '24 20:10 benwtrent

Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :)

Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice.

The oversampling rates were [1, 1.5, 2, 3, 4, 5]. HNSW params m=16,efsearch=100. Recall@100.

Cohere v2 1M

quantization	Index Time	Force Merge time	Mem Required
1 bit	395.18	411.67	175.9MB
4 bit (compress)	1877.47	491.13	439.7MB
7 bit	500.59	820.53	833.9MB
raw	493.44	792.04	3132.8MB

cohere-v2-bit-1M

Cohere v3 1M

1M Cohere v3 1024

quantization	Index Time	Force Merge time	Mem Required
1 bit	338.97	342.61	208MB
4 bit (compress)	1113.06	5490.36	578MB
7 bit	437.63	744.12	1094MB
raw	408.75	798.11	4162MB

cohere-v3-bit-1M

e5Small

quantization	Index Time	Force Merge time	Mem Required
1 bit	161.84	42.37	57.6MB
4 bit (compress)	665.54	660.33	123.2MB
7 bit	267.13	89.99	219.6MB
raw	249.26	77.81	793.5MB

e5small-bit-500k

Oct 21 '24 23:10 benwtrent

Hey @ShashwatShivam https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq

that is the testing script I use.

But if Lucene has since been updated with a 101 codec, I would need to update this branch.

Nov 05 '24 12:11 benwtrent

@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before?

Nov 07 '24 15:11 ShashwatShivam

@ShashwatShivam why do you think the index size (total size of all the files) should be smaller?

We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only.

I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well.

Nov 07 '24 16:11 benwtrent

@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw?

Nov 07 '24 23:11 ShashwatShivam

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

Nov 08 '24 12:11 benwtrent

Hey @benwtrent, Thank you for all your help so far! I have a question about the oversampling used to increase recall. From what I understand, it scales up the top-k and fanout values by the oversampling factor. In the final match set, do we return only the best top-k documents (not scaled up, but the original value)? I couldn't locate the code where the reranking or selection of the best k results from the expanded match set happens. Could you please help me find that part? Thanks again!

Nov 10 '24 08:11 ShashwatShivam

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

Once this cool change is merged let's fix luceneutil's KNN benchy tooling (knnPerfTest.py, KnnGraphTester.java) to compute/report the "memory column" ("hot RAM", "searchable RAM", something)? Basically everything except the original (float32 or byte) vectors. I'll open an upstream luceneutil issue...

Nov 12 '24 12:11 mikemccand

Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review".

@mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists).

@ShashwatShivam

Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how.

It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector...

Nov 12 '24 13:11 benwtrent

I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility:

Set up the luceneutil repository following the installation instructions provided.
Switch branches to this specific branch since the latest mainline branch is not compatible with the feature needed for this experiment.
Change the branch of lucene_candidate to benwtrent:feature/adv-binarization-format to incorporate advanced binarization formats.
Run knnPerfTest.py after specifying the document and query file paths to the stored Cohere data files. The runtime parameters were set as follows:
- nDoc = 500,000
- topk = 10
- fanout = 100
- maxConn = 32
- beamWidth = 100
- oversample values tested: {1, 1.5, 2, 3, 4, 5}
I used quantizeBits = 1 for RaBitQ+HNSW and quantizeBits = 32 for regular HNSW.

A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image:
output .

Nov 12 '24 16:11 ShashwatShivam

FYI, a blog post on RaBitQ:

https://dev.to/gaoj0017/quantization-in-the-counterintuitive-high-dimensional-space-4feg

Dec 10 '24 21:12 tanyaroosta

Thanks, Tanya @tanyaroosta , for sharing our blog about RaBitQ in this thread. I am the first author of the RaBitQ paper. I am glad to know that our RaBitQ method has been discussed in the threads here. Regarding the BBQ (Better Binary Quantization) method mentioned in these threads, my understanding is that it majorly follows the framework of RaBitQ and makes some minor modifications for practical performance consideration. The claimed key features of BBQ as described in a blog from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch - e.g., normalization around a centroid, multiple error correction values, asymmetric quantization, bit-wise operations, all originate from our RaBitQ paper.

We note that it is quite often that the industry customizes some methods from academia to better suit their applications, but the industry rarely gives the variant a new name and claim as a new method. For example, the PQ and HNSW methods are from academia and have been widely adopted in the industry with some modifications, but the industry still respects their original names. We believe the same practice should be followed for RaBitQ.

In addition, we would like to share that we have extended RaBitQ to support quantization beyond 1-bit per dimension (e.g., 2-bit, 3-bit, …). The paper of the extended RaBitQ was made available in Sep 2024. It achieves so by constructing a larger codebook than that of RaBitQ and can be equivalently understood as an optimized scalar quantization method. For details, please refer to the paper and also a blog that we have recently posted.

Dec 14 '24 04:12 gaoj0017

Closing this PR in deference to this one: https://github.com/apache/lucene/pull/14078

An evolution of scalar quantization proved more flexible and provided better recall in our experiments.

Dec 17 '24 22:12 benwtrent

Add a Better Binary Quantizer (RaBitQ) format for dense vectors

Highlevel design

Cohere v2 1M

Cohere v3 1M

e5Small