lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add a Better Binary Quantizer (RaBitQ) format for dense vectors

Open benwtrent opened this issue 1 year ago • 10 comments

Highlevel design

RaBitQ is basically a better binary quantization, which works across all models we have tested against. Like PQ, it does require coarse grained clustering to be effective at higher vector densities (effective being defined as only requiring 5x or lower oversampling for recall>95%). But in our testing, the number of vectors required per cluster can be exceptionally large (10s to 100s of millions).

The euclidean vectors as stored in the index:

quantized vector distance_to_centroid vector magnitude
(vector_dimension/8) bytes float float

For dot-product vectors:

quantized vector vector dot-product with binarized self vector magnitude centroid dot-product
(vector_dimension/8) bytes float float float

The vector metadata, in addition to all the regular things (similarity, encoding, sparse vector DISI, etc.).

For indexing into HNSW we actually have a multi-step process. RaBitQ encodes the query vectors different than the index vectors. Consequently, during segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted.

We then read from the query temporary file when adding a vector to the graph and when exploring HNSW, we search the indexed quantized values.

closes: https://github.com/apache/lucene/issues/13650

benwtrent avatar Aug 13 '24 19:08 benwtrent

@benwtrent

possibly switch to LongValues for storing vectorOrd -> centroidOrd mapping

I was thinking about adding centroids mappings as LongValues at the end of meta file, but this could potentially make meta file quite large (for 100M docs, we would need extra 100Mb). We really try to keep meta files small, so I would prefer either:

  • to keep the current approach (add a byte at the end of each vector in the vectors file). Indeed, this may throw off paging size, but may be effect on memory mapped files is not big?
  • adding an extra file for centroids mapping. Centroids mapping can be accessed through memory mapped file, or loaded directly into memory on first use.

What do you think?

For now, we keep adopting the 1st (current) approach.

mayya-sharipova avatar Aug 21 '24 18:08 mayya-sharipova

100MB assumes that even when compressed, it's a single byte per centroid. 100M vectors might only have 2 centroids and thus only need two bits two store.

Also, I would expect the centroids to be at the end of the "veb" file, not metadata. Like we do already for the sparse vector ord to doc resolution.

But, either solution needs testing for sure.

benwtrent avatar Aug 21 '24 19:08 benwtrent

@benwtrent we are trying to run tests with the RaBitQ Lucene implementation, and are not able to replicate the numbers reported in the paper. Have you run tests as part of the implementation, and if so, could you please point me to the location? We want to see if we have made an error somewhere. Thanks.

tanyaroosta avatar Sep 17 '24 15:09 tanyaroosta

@tanyaroosta we are still doing larger scale testing, but if you want to test with LuceneUtil, here is the branch I am using: https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq?expand=1

2-5x oversampling is required with HNSW to get good recall.

The original paper does periodic rescoring with the raw vectors which effectively means you need random access with raw vectors, even when NOT using HNSW. This just doesn't scale when trying to reduce memory usage.

Instead, we are opting for a more traditional approach of oversampling and reranking.

benwtrent avatar Sep 17 '24 15:09 benwtrent

Following up on the above comment by tanyaroosta, the dataset I was using for benchmarking RaBitQ through Luceneutil (main branch) was amazon's ASIN and query embeddings (which are 256 dim float vectors). Here are three sets of results I got:

Normal HNSW

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.721         0.128  1439443    10       6       32         40         no    81.51          42.58             1          1455.25
 0.828         0.166  1439443    10       6       64        100         no   202.30          78.03             1          1472.66
 0.886         0.202  1439443    10       6       64        250         no   545.83         242.29             1          1493.95
 0.914         0.227  1439443    10       6       64        500         no  1111.43         452.62             1          1511.73

RaBitQ HNSW (same parameters) - codec Lucene912HnswBinaryQuantizedVectorsFormat

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.448         0.095  1439443    10       6       32         40         no    86.15          45.45             1          1526.72
 0.522         0.130  1439443    10       6       64        100         no   195.14          89.08             1          1555.16
 0.542         0.167  1439443    10       6       64        250         no   524.03         207.12             1          1589.50
 0.547         0.192  1439443    10       6       64        500         no  1035.65         415.44             1          1618.38

Only RaBitQ - codec Lucene912BinaryQuantizedVectorsFormat

Results:
recall  latency (ms)     nDoc  topK  fanout  maxConn  beamWidth  quantized  index s  force merge s  num segments  index size (MB)
 0.000        87.658  1439443    10     100       64         40         no    12.39           8.01             1          1471.69
 0.000        87.830  1439443   100     100       64         40         no    12.53           7.78             1          1471.69

The first issue we're facing is the large recall reduction for RaBitQ HNSW when compared to pure HNSW. Is this recall regression expected? Secondly, the pure RaBitQ implementation returns 0 recall, which is definitely suspect. Perhaps there is a bug on my end when trying to benchmark it using luceneutil?

@benwtrent I'll also try working with your branch of luceneutil to see if it changes the results

ShashwatShivam avatar Sep 17 '24 15:09 ShashwatShivam

@ShashwatShivam so, the flat codec version is sneaky, depending on when you cloned the repo, it might not be doing anything

Lucene by default will return nothing for approx kann over a flat index (I think this should be fixed).

But I think the latest BBQ branch should have this patched. I can check in a little while

benwtrent avatar Sep 17 '24 16:09 benwtrent

Here is some more flat index test results. This was to exercise and see how the number of coarse grained centroids changes recall & speed.

Lucene912BinaryQuantizedVectorsFormat recall@100 recall@100|500 recall@100|1000 index time & force-merge mean search time (brute-force)
hotpotqa-E5-small (5233329 vectors, 1 centroid) 47 81 90 79618ms 177ms
hotpotqa-E5-small (5233329 vectors, 255 centroids) 52 86 94 267522ms 211ms
hotpotqa-gte-base (5233329 vectors, 1 centroid) 66 96 99 214254ms 227ms
hotpotqa-gte-base (5233329 vectors, 255 centroids) 71 98 99 617887ms 275ms
dbpedia-entity-arctic (4635922 vectors, 1 centroid) 56 89 95 199827ms 205ms
dbpedia-entity-arctic (4635922 vectors, 255 centroid) 59 91 96 618797ms 220ms

Models used:

  • https://huggingface.co/thenlper/gte-base
  • https://huggingface.co/intfloat/e5-small
  • https://huggingface.co/Snowflake/snowflake-arctic-embed-m

benwtrent avatar Sep 18 '24 19:09 benwtrent

I am currently working on moving this to Lucene101 format with the bug fixes we discovered in additional testing.

benwtrent avatar Oct 16 '24 16:10 benwtrent

I will open a PR against Lucene Util to update it to utilize these formats and show y'all some runs with it soon. But The PR is ready for general review.

benwtrent avatar Oct 18 '24 20:10 benwtrent

Here is some Lucene Util Benchmarking. Some of these numbers actually contradict some of my previous benchmarking for int4. Which is frustrating, I wonder what I did wrong then or now. Or of float32 got faster between then and now :)

Regardless, this shows that bit quantization is generally as fast as int4 search or faster and you can get good recall with some oversampling. Combining with the 32x reduction in space its pretty nice.

The oversampling rates were [1, 1.5, 2, 3, 4, 5]. HNSW params m=16,efsearch=100. Recall@100.

Cohere v2 1M

quantization Index Time Force Merge time Mem Required
1 bit 395.18 411.67 175.9MB
4 bit (compress) 1877.47 491.13 439.7MB
7 bit 500.59 820.53 833.9MB
raw 493.44 792.04 3132.8MB

cohere-v2-bit-1M

Cohere v3 1M

1M Cohere v3 1024

quantization Index Time Force Merge time Mem Required
1 bit 338.97 342.61 208MB
4 bit (compress) 1113.06 5490.36 578MB
7 bit 437.63 744.12 1094MB
raw 408.75 798.11 4162MB

cohere-v3-bit-1M

e5Small

quantization Index Time Force Merge time Mem Required
1 bit 161.84 42.37 57.6MB
4 bit (compress) 665.54 660.33 123.2MB
7 bit 267.13 89.99 219.6MB
raw 249.26 77.81 793.5MB

e5small-bit-500k

benwtrent avatar Oct 21 '24 23:10 benwtrent

Hey @ShashwatShivam https://github.com/mikemccand/luceneutil/compare/main...benwtrent:luceneutil:bbq

that is the testing script I use.

But if Lucene has since been updated with a 101 codec, I would need to update this branch.

benwtrent avatar Nov 05 '24 12:11 benwtrent

@benwtrent thanks for giving the link to the testing script, it works! One question - the index size it reports is larger than the HNSW index size. For e.g. I was working with a Cohere 768 dim dataset with 500k docs and the index sizes were 1488.83 MB and 1544.79 MB for HNSW and RaBitQ (Lucene101HnswBinaryQuantizedVectorsFormat) respectively, which seems incorrect. Could you please tell me why this discrepancy occurs, if you've seen this issue before?

ShashwatShivam avatar Nov 07 '24 15:11 ShashwatShivam

@ShashwatShivam why do you think the index size (total size of all the files) should be smaller?

We store the binary quantized vectors and the floating point vectors. So, I would expect about a 5% increase in disk size just from vectors only.

I have also noticed that the HNSW graph itself ends up being more densely connected, but this is only a marginal increase in disk space as well.

benwtrent avatar Nov 07 '24 16:11 benwtrent

@benwtrent makes sense, I wasn't accounting for the fact that the floating vectors are being stored too. I guess I should have instead asked how to reproduce the 'memory required' column, which shows a marked reduction for 1 bit quantization v/s raw?

ShashwatShivam avatar Nov 07 '24 23:11 ShashwatShivam

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

benwtrent avatar Nov 08 '24 12:11 benwtrent

Hey @benwtrent, Thank you for all your help so far! I have a question about the oversampling used to increase recall. From what I understand, it scales up the top-k and fanout values by the oversampling factor. In the final match set, do we return only the best top-k documents (not scaled up, but the original value)? I couldn't locate the code where the reranking or selection of the best k results from the expanded match set happens. Could you please help me find that part? Thanks again!

ShashwatShivam avatar Nov 10 '24 08:11 ShashwatShivam

@ShashwatShivam I don't think there is a "memory column" provided anywhere. I simply looked at the individual file sizes (veb, vex) and summed their sizes together.

Once this cool change is merged let's fix luceneutil's KNN benchy tooling (knnPerfTest.py, KnnGraphTester.java) to compute/report the "memory column" ("hot RAM", "searchable RAM", something)? Basically everything except the original (float32 or byte) vectors. I'll open an upstream luceneutil issue...

mikemccand avatar Nov 12 '24 12:11 mikemccand

Quick update, we have been bothered with some of the numbers (for example, models like "gist" perform poorly) and we have some improvements to get done first before flipping back to "ready for review".

@mikemccand YES! That would be great! "Memory required" would be the quantized file size + hnsw graph file size (if the graph exists).

@ShashwatShivam

Sorry for the late reply. There are no "out of the box" rescoring actions directly in Lucene. Mainly because the individual tools are (mostly) already available to you. You can ask for more overall vectors with one query, and then rescore the individual documents according to the raw vector comparisons. I admit, this requires some Lucene API know how.

It would be good for a "vector scorer" to indicate if its an estimation or not to allow for smarter actions in the knn doc collector...

benwtrent avatar Nov 12 '24 13:11 benwtrent

I conducted a benchmark using Cohere's 768-dimensional data. Here are the steps I followed for reproducibility:

  1. Set up the luceneutil repository following the installation instructions provided.

  2. Switch branches to this specific branch since the latest mainline branch is not compatible with the feature needed for this experiment.

  3. Change the branch of lucene_candidate to benwtrent:feature/adv-binarization-format to incorporate advanced binarization formats.

  4. Run knnPerfTest.py after specifying the document and query file paths to the stored Cohere data files. The runtime parameters were set as follows:

    • nDoc = 500,000
    • topk = 10
    • fanout = 100
    • maxConn = 32
    • beamWidth = 100
    • oversample values tested: {1, 1.5, 2, 3, 4, 5}

    I used quantizeBits = 1 for RaBitQ+HNSW and quantizeBits = 32 for regular HNSW.

A comparison was performed between HNSW and RaBitQ, and I observed the recall-latency tradeoff, which is shown in the attached image:
output.

ShashwatShivam avatar Nov 12 '24 16:11 ShashwatShivam

FYI, a blog post on RaBitQ:

https://dev.to/gaoj0017/quantization-in-the-counterintuitive-high-dimensional-space-4feg

tanyaroosta avatar Dec 10 '24 21:12 tanyaroosta

Thanks, Tanya @tanyaroosta , for sharing our blog about RaBitQ in this thread. I am the first author of the RaBitQ paper. I am glad to know that our RaBitQ method has been discussed in the threads here. Regarding the BBQ (Better Binary Quantization) method mentioned in these threads, my understanding is that it majorly follows the framework of RaBitQ and makes some minor modifications for practical performance consideration. The claimed key features of BBQ as described in a blog from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch - e.g., normalization around a centroid, multiple error correction values, asymmetric quantization, bit-wise operations, all originate from our RaBitQ paper.

We note that it is quite often that the industry customizes some methods from academia to better suit their applications, but the industry rarely gives the variant a new name and claim as a new method. For example, the PQ and HNSW methods are from academia and have been widely adopted in the industry with some modifications, but the industry still respects their original names. We believe the same practice should be followed for RaBitQ.

In addition, we would like to share that we have extended RaBitQ to support quantization beyond 1-bit per dimension (e.g., 2-bit, 3-bit, …). The paper of the extended RaBitQ was made available in Sep 2024. It achieves so by constructing a larger codebook than that of RaBitQ and can be equivalently understood as an optimized scalar quantization method. For details, please refer to the paper and also a blog that we have recently posted.

gaoj0017 avatar Dec 14 '24 04:12 gaoj0017

Closing this PR in deference to this one: https://github.com/apache/lucene/pull/14078

An evolution of scalar quantization proved more flexible and provided better recall in our experiments.

benwtrent avatar Dec 17 '24 22:12 benwtrent