Support multiple HNSW graphs backed by the same vectors
Description
For use-cases of searching different subsets of vectors in the index, where a non-trivial portion of vectors across fields are overlapping.
This could be done today by:
- Indexing all vectors in a single field and using query-time pre-filtering, but it can become expensive if the filter is restrictive (and large parts of the graph have to be traversed to collect accepted docs).
- Creating separate fields for groups of vectors, but it could increase index size for a large number of overlapping vectors across subsets.
Implementation wise, we'd have multiple HNSW graphs per-field (backed by the same raw vectors), each identifiable by an ID (perhaps an integer or string). Each document in this field would specify a set of IDs along with its raw vector, and separate HNSW graphs would be created for each ID. Similarly the query would specify an ID to search in.
Wanted to check how common the use-case is, and if adding such a field would help..
If filtered performance is critical, then I wonder if there could be better ways, e.g. using a vector search algorithm from the IVF family instead of HNSW, and configuring an index sort on the same fields that you plan on using as an ID in your proposal so that filtering would boil down to intersecting doc IDs with a range. Or creating multiple indexes, e.g. an e-commerce catalog could have one Lucene index for its most popular category and another Lucene index for all other categories - when there is no filter you would search these two indexes with a MultiReader, and otherwise it would select which index to search depending on the category filter. Or a mix of these two ideas.
My concern with your proposal is that it would be quite invasive in terms of API (IDs would need to be configurable in the IndexableField API, exposed by VectorValues, etc.) and hard to maintain (every future KNNVectorsFormat would have to be able to deal with this ID concept).
I wonder if we could support this in a different way via some kind of "shadow field" that would refer to the same underlying vector data but could be indexed on a sub-set of documents.
A "shadow field" sounds like a view. It would be nice to be able to build an index on top of a view, but wouldn't this introduce dependencies between fields in a way that doesn't exist today?
A "shadow field" sounds like a view. It would be nice to be able to build an index on top of a view, but wouldn't this introduce dependencies between fields in a way that doesn't exist today
yes
@kaivalnp The use-case you have mentioned is pretty interesting.
@msokolov the idea of shadow field can help unlock other use-cases too like https://github.com/apache/lucene/issues/14247. The idea in #14247 was to inject information with vector field that can be used to build HNSW graph based on the injected information(example tenant information in a multi-tenant index).
I'm wondering if ACORN would work for this use case: https://arxiv.org/abs/2403.04871
Thank you for your input everyone!
I'm wondering if ACORN would work for this use case
@dungba88 while ACORN may speed up the graph-search component of a pre-filtered search -- it still requires building a BitSet of accepted docs at query time + searching through a large graph containing all documents. A positive consequence of this search is that a user can filter on any arbitrary Query
The benefit I'm suggesting is for when a user wants to filter on a constraint known at index-time, and consequently move the bulk of work to indexing to create and search smaller graphs -- which may be acceptable because the size of HNSW graphs is smaller than raw vector data
In an e-commerce marketplace, this could mean having different graphs for (often overlapping) categories of products -- for example:
- If a user first performs a broad search on the entire catalog, we can search for vector results without any filter
- But if they "scope" into a category (like electronics, furniture, etc.) -- we need to create a
BitSetof all products in the category, traverse a large HNSW graph, but only "collect" vectors matching the filter - The
BitSetcan be shared across queries, but a separate one needs to be created for each index checkpoint. We'll also need a separateBitSetfor each category -- which is stored on-heap and can become memory-intensive for a large number of unique categories (or filters in general) - Since these categories are known at index-time, I'm proposing to create separate HNSW graphs for each category (or filter in general). These will be smaller, off-heap, and backed by the same set of raw vectors!
- This will increase indexing time and size, but at the benefit of lower RAM pressure and faster search (smaller graphs) -- and we can let the user make this tradeoff decision..
The use-case you have mentioned is pretty interesting.
@navneet1v thanks, please let me know if OpenSearch has use-cases which will benefit from this proposal (e.g. due to multi-tenant nodes, or specific applications)
@jpountz while I started out thinking of adding an explicit set of IDs for each vector, and then create separate graphs for each ID -- I realize this approach will require non-trivial API changes to deal with IDs as a new concept. However, there may an alternate way of doing the same thing (see following comment)
What if we de-duplicate vectors in Lucene?
- Today, we have a
Lucene99FlatVectorsFormatresponsible for reading / writing raw vectors- This format maintains a list of vectors per-field during indexing, simply appending to the list for new documents
- All vectors within a field are written in a single "block" during flush, one vector after another
- All "blocks" are present in the same file, one field after another
- The metadata file maps each field to its "block", i.e. an offset and length in the raw vector file
- What if we combine "homogeneous" fields into a single "block", and de-duplicate vectors inside a "block"?
- For vector data, "homogeneous" fields are ones with the same
byte/float32encoding and dimension - This proposal has an added side-benefit of de-duplicating vectors within a field as well (if the features used for vector generation are identical across two documents)
- For vector data, "homogeneous" fields are ones with the same
An end-user can simply add new top-level vector fields in the same document with the same vector, and Lucene takes care of storing a single unique copy:
IndexWriter iw; // init
float[] vector1; // init
Document doc1 = new Document();
doc1.add(new KnnFloatVectorField("vector-category-1", vector1));
doc1.add(new KnnFloatVectorField("vector-category-2", vector1));
doc1.add(new KnnFloatVectorField("vector-category-3", vector1));
iw.add(doc1); // only stores one copy of vector1, but creates / adds to HNSW graphs 1, 2, 3
float[] vector2; // init
Document doc2 = new Document();
doc2.add(new KnnFloatVectorField("vector-category-1", vector2));
doc2.add(new KnnFloatVectorField("vector-category-3", vector2));
iw.add(doc2); // only stores one copy of vector2, but creates / adds to HNSW graphs 1, 3
Performance considerations
Current State:
- Within the format, vectors are represented as "ordinals", and a mapping from ordinal -> docid is stored in the metadata file
- Edges of a node in the HNSW graph are stored as a list of ordinals in increasing order, and the graph itself is a concatenation of such lists:
- i.e. if we traverse the vector with "ordinal"
kduring graph search, we seek to positionk * maxConn * bytes per ordinal (= 4)and readmaxConn * bytes per ordinal (= 4)bytes as a list of "ordinals" of sizemaxConn - We retrieve the vector corresponding to an "ordinal" by seeking to
ordinal * dimension * bytes per dimension (= 1 for byte vectors, or 4 for float32 vectors)
- i.e. if we traverse the vector with "ordinal"
Query Time: If we only store unique copies of each vector:
- Multiple ordinals can point to the same vector, so we need to maintain an additional mapping of ordinal -> position of vector in raw data. This means one additional lookup per vector retrieval (TBD store this mapping on or off-heap?)
- We can also avoid the lookup completely, by "deflating" each
list[neighbor ord]in the HNSW graph to alist[tuple[neighbor ord, position of vector in raw file]]-- but this will 2x the graph size (TBD if this is required or not)
- We can also avoid the lookup completely, by "deflating" each
- Currently neighbors are stored in increasing order of "ordinals", which has some data access pattern benefits: if multiple vectors are retrieved in the same page. In our proposal, neighbors will need to be stored in increasing order of position in the raw file instead, to maintain the same benefits (breaking the convention of increasing ordinal IDs, but I don't think this is a constraint)
- Note that our proposal has an added data access pattern benefit for "de-duplicated" vectors -- they'll be present at the same physical location and a guaranteed OS cache-hit (where earlier, they were present at different locations in the raw file, and may or may not result in an OS cache-hit)
- Another consequence of storing vectors in non-"ordinal" order is that exact search iterates vectors in increasing docid (i.e. increasing ordinal) order -- but this is not a constraint, since it is going to traverse all vectors (i.e. we'll need to traverse vectors in the order of presence in raw file, which is doable in theory, but will need some API additions)
Indexing Time: Indexing will be more expensive (more work to determine "unique" vectors + create new / smaller HNSW graphs)
TBD: Ideal order for storing raw vectors?
- If we maintain insertion order -- unique vectors get new monotonically increasing positions in the raw file, and every "hit" (or non-unique vector) gets pointed to an existing position. This could be beneficial with something like
BpVectorReordererwhere documents with vectors "closer" to each other have nearby Lucene doc IDs. However, merges become expensive -- as we have to combine vectors in arbitrary order across two files, maintaining only unique ones -- so everything has to be loaded onto RAM (something likeO(1)insertion time during indexing, andO(n)time +O(n)memory during merging -- this may be a non-starter) - If we choose a stable sort (ideally one with some data locality benefits) -- then merging becomes inexpensive, as vectors are "sorted" in a deterministic order -- so two files can be efficiently merged (something like
O(log n)insertion at indexing time, andO(n)time +O(1)memory during merging). We can acceptO(log n)because it's not asymptomatically worse than HNSW
Other considerations
- We'll need some codec-level plumbing to ensure that the same instance (not just class!) of the raw format is shared by all HNSW formats
- TBD how this whole proposal works out for quantized vectors:
- Two different non-quantized vectors can map to the same quantized vector, but with different scaling and score correction constants
- Two same non-quantized vectors can map to different quantized vectors, based on distribution within the field
This idea is still WIP -- but on a high-level, seems doable without changing public APIs.. Please let me know what you think!
- This proposal has an added side-benefit of de-duplicating vectors within a field as well (if the features used for vector generation are identical across two documents)
This is really interesting! It reminds me of ZFS's block-level deduping, and how annoyingly and weirdly effective it often is (because the files I store have surprising redundancy / I forget that I made N copies / whatever).
Hmmm must the deduping be perfect (guaranteed to dedup all dups)? What if it only did so within one document, which would enable this "compile KNN prefilter to separate field's HNSW graph during indexing" efficiently? But not across documents. It might be a baby step with less added indexing cost since you wouldn't need a global sort order for vectors, merging wouldn't need to check for dups across segments, etc.? Within one document we could even do a quick == check to see if the identical float[] or byte[] was passed to multiple fields (common case, hopefully)?
Multiple ordinals can point to the same vector, so we need to maintain an additional mapping of ordinal -> position of vector in raw data.
Hmm, why would we need to also know the position? Oh I see -- fieldA will refer to the dedup'd vector with a different ordinal than fieldB because ordinals must be compact (0..N) within each field? This might be another (small?) benefit of only dedup within one document: this ordinalA -> ordinalB mapping would be monotonic, and compress well (PackedLongValues.monotonicBuilder), like the DocMap we use during merging to map around deletions, or map after merge sort (for the statically sorted indexes).
Maybe this could be done with "ordinal to ordinal" mapping, just within the dup'd field? User would add fieldA = new KnnFloatVectorField(...) and then fieldB = fieldA.newLinkedField("name"), so fieldB knows its sharing from fieldA's vector. fieldA would build a normal HNSW graph like we have today (no ord remapping, positions, etc.), but fieldB would record an additional ord -> ord map (maybe bimap?).
We'll need some codec-level plumbing to ensure that the same instance (not just class!) of the raw format is shared by all HNSW formats
Isn't this how the default Codec (currently Lucene103Codec) works? It's per-field HNSW format, but by default all fields use the same instance of KnnVectorsWriter when writing a segment?
What if it only did so within one document, which would enable this "compile KNN prefilter to separate field's HNSW graph during indexing" efficiently? But not across documents
Thanks @mikemccand, I like this a lot!
It may be possible -- but deviates from how we merge vectors: today, the KnnVectorsWriter indexes and merges vectors per-field, but it'll need to be document-by-document to avoid buffering all vectors across multiple segments on-heap. This would, however, avoid the need to sort the vectors in the index! (the file could be structured document-wise instead, see below)
Today, the .vec file looks something like:
--- field 1 begin:
(vector for field 1, document d1) // position x0
(vector for field 1, document d2)
(vector for field 1, document d3)
--- field 1 end, field 2 begin:
(vector for field 2, document d1) // position x1
(vector for field 2, document d3)
--- field 2 end, field 3 begin:
(vector for field 3, document d1) // position x2
(vector for field 3, document d2)
--- field 3 end, and so on..
The .vem file contains tuples to map fields -> vector "blocks", as (position, length)
The "block" info for field 1 looks like (x0, x1 - x0)
The "block" info for field 2 looks like (x1, x2 - x1)
..and so on
The ordToDoc mapping for field 1 looks like {0 -> d1, 1 -> d2, 2 -> d3}
The ordToDoc mapping for field 2 looks like {0 -> d1, 1 -> d3}
The ordToDoc mapping for field 3 looks like {0 -> d1, 1 -> d2}
..and so on
Now if we de-duplicate vectors within a document (i.e. only store one unique copy of the vector, across fields), then the boundary between "blocks" is broken. We could structure the file by documents instead:
--- document d1 begin:
(vector for field 1, document d1) // position x0
(vector for field 2, document d1) // position x1
(vector for field 3, document d1) // position x2
--- document d1 end, document 2 begin:
(vector for field 1, document d2) // position x3
(vector for field 3, document d2) // position x4
--- document d2 end, document 3 begin:
(vector for field 1, document d3) // position x5
(vector for field 2, document d3) // position x6
--- document d3 end, and so on..
..so an additional ordToVec mapping is needed (the ordToDoc mapping will remain unchanged)
The ordToVec mapping for field 1 looks like {0 -> x0, 1 -> x3, 2 -> x5}
The ordToVec mapping for field 2 looks like {0 -> x1, 1 -> x4}
The ordToVec mapping for field 3 looks like {0 -> x2, 1 -> x6}
We'll need to merge two .vec files document-by-document instead, in order to identify and de-dup vectors, and immediately write to the new segment without having to buffer everything in memory (if we still merged per-field)
Pros (against segment-level de-duping):
- Faster indexing / merge (avoids sorting)
- Gain back some data locality benefits if a vector reorder policy is used
- Need not break the current convention of neighbor ordinals of HNSW nodes being in increasing order (if we did segment-level de-duping, these neighbor ordinals would need to be stored in increasing order of underlying vectors to regain some data locality benefits)
-
ordToVecwould compress well like you mentioned
Cons (against segment-level de-duping):
- Larger index (cannot de-dup across fields)
Cons (in general):
- We'll lose some data locality benefits (since vectors within a field are now spread out across the file, instead of being present in contiguous chunks). This is true for both document-level and segment-level de-duping
I'm really leaning towards the per-document de-duping now!
and then
fieldB = fieldA.newLinkedField("name"), sofieldBknows its sharing fromfieldA's vector
This would shift the responsibility of "identifying" duplicates to the user, and IMO Lucene should take care of this internally. It can be done by maintaining a document-level set of vectors during indexing and merge, and will (probably) not add a lot of burden, time / space / code complexity wise..
Isn't this how the default Codec (currently
Lucene103Codec) works? It's per-field HNSW format, but by default all fields use the same instance ofKnnVectorsWriterwhen writing a segment?
While this does use a single KnnVectorsWriter -- the PerFieldKnnVectorsFormat creates some internal writers here that may be different for different fields (even if they both use the same flat format internally) -- which we'd like to avoid..
I'll first attempt to hack luceneutil to (explicitly) create new vector fields to "simulate" a filter as this proposal would achieve (internally) -- and we can judge from benchmarks..
@mikemccand I was able to hack luceneutil to perform the following benchmark:
- Take an additional input
filterFactor, where documents withID % filterFactor == 0are considered "live" (sofilterFactor= 2 implies 50% docs are "live",filterFactor= 5 implies 20% docs are "live" and so on..) - Take another input
filterStrategywith possible values of:-
index-time: create a separate vector field (and corresponding HNSW graph), with just the filtered documents -
query-time: perform a pre-filtered search on the large graph
-
- Changes I used are in https://github.com/kaivalnp/luceneutil/tree/index-time-filtering (warning: rough changes!)
Cohere vectors, 768d, MAXIMUM_INNER_PRODUCT
query-time filtering:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) filterStrategy filterFactor indexType
0.882 3.959 3.957 1.000 200000 100 50 32 200 no 7759 10.95 18268.18 13.12 1 600.70 query-time 2 HNSW
0.888 3.932 3.931 1.000 200000 100 50 32 200 no 6856 10.59 18878.61 13.08 1 600.85 query-time 5 HNSW
0.877 3.106 3.105 1.000 200000 100 50 32 200 no 4780 10.76 18592.54 12.89 1 600.73 query-time 10 HNSW
0.830 2.313 2.312 0.999 200000 100 50 32 200 no 2814 10.92 18316.70 12.88 1 600.72 query-time 20 HNSW
index-time filtering:
recall latency(ms) netCPU avgCpuCount nDoc topK fanout maxConn beamWidth quantized visited index(s) index_docs/s force_merge(s) num_segments index_size(MB) filterStrategy filterFactor indexType
0.929 1.199 1.198 0.999 200000 100 50 32 200 no 4349 15.05 13291.69 18.42 1 901.09 index-time 2 HNSW
0.952 1.040 1.039 0.999 200000 100 50 32 200 no 4084 13.87 14422.73 17.97 1 720.66 index-time 5 HNSW
0.967 0.862 0.861 0.999 200000 100 50 32 200 no 3763 12.90 15501.47 16.01 1 660.61 index-time 10 HNSW
0.980 0.670 0.669 0.998 200000 100 50 32 200 no 3287 12.61 15861.69 15.93 1 630.74 index-time 20 HNSW
..and the gains are apparent (~70% speedup in filtered search time)!
Couple of things to note:
- The
index_size(MB)is larger forindex-timefilters, because we're creating a new field without de-duping vectors -- this will be improved once we de-dup vectors - The graph search time for
index-timefiltered search will be higher in reality, as we'll lose some data locality benefits when the user adds additional vector fields - This benchmark just measures graph search time, and not the overhead to create and maintain a pre-filter
BitSet, so the true gains withindex-timefiltering are probably higher - This speedup is not free -- the user pays by moving cost up-front to indexing, see
index(s)andforce_merge(s)
These are awesome results @kaivalnp! And this was only 200K docs -- with larger indices would the gains be more or less?
Also, it's quite disturbing that even at a not-so-restrictive filter (50%), the recall is already quite a bit worse than the index-time filter (0.882 vs 0.929)? And then it gets worse -- at a 5% accept filter it's 0.830 vs 0.980.
Let's get your luceneutil changes merged -- this is useful for benchmarking -- I'll try to review that PR soon.
These gains are awesome -- and for starters Lucene users can simply duplicate their vectors (like you did for this test), wasting index storage. That already works today.
So ... how do we make progress on NOT wasting disk storing (dedup somehow)? If we dedup only within one doc, it is (maybe?) simpler than dedup across docs? Only downside is probably some Lucene users will have substantial dups across docs, and it would've helped them? But perhaps such users should use index-time joins instead... or maybe they are not realizing they have dups.
I wonder how this would work for MMAP'd regions. I need to read through the entire implementation, but moving away from all vectors for a single field being represented in a single chunk to vectors across all fields being represented in one big chunk seems dubious.
Deduplicating within a doc seems prudent to me to avoid this cross-field antics. Allowing ordinals to all point to the same file region and mapping those to a doc id.
Also, as an aside, I think HNSW would benefit (for less restrictive filters) by auto-post filtering for users. So, if the filter matches 90% of docs, maybe only gather 10% more vectors and then post filter. This will likely speed things up and prevent eager evaluation of filters.
Thank you for your inputs @mikemccand @benwtrent :)
Let's get your luceneutil changes merged -- this is useful for benchmarking
I had it in a shared branch earlier, opened https://github.com/mikemccand/luceneutil/pull/468 now!
If we dedup only within one doc, it is (maybe?) simpler than dedup across docs?
It is (probably) simpler API-wise, perhaps less work at indexing, but larger indexes
or maybe they are not realizing they have dups
Dups inside Lucene could occur one of three ways:
- The inputs to the model generating vectors is the same across two documents (even though other fields are different)
- The model generates the same vector for two different inputs (sort of like a hash collision)
- The user knowingly passes the same vector to a new field for "index-time filtering"
I guess users can figure out (1) on their own, (2) is much less likely, and I'm hoping users find some value in (3) with the above benchmark change :)
vectors across all fields being represented in one big chunk
FWIW we'd still create multiple chunks, with each chunk containing "homogeneous" vectors (i.e. ones where duping is even possible). In this whole idea, locality benefits can be improved if maximum number "homogeneous" groups are created, with each group having the minimum number of fields that will actually share vectors.
I don't know how to best determine these groups -- the obvious parameters to include are: encoding (byte / float32) and dimensions. Perhaps we ask the user directly, but this would create some kind of dependency b/w fields that does not exist today. To avoid this complication -- I was proposing that Lucene "automatically" de-dup based on the above parameters -- but suggestions welcome here. At the end, we just want separate HNSW graphs created for filters known at index-time :)
I agree that use case (3) above is the target (user knowingly passes duplicate vector).
Then it bothers me that we don't make use of the information that one of the fields is strictly a subset of the other (as it is filtered), and that we require callers to supply the same vectors twice while indexing, and then, ignoring the information, try to recreate it through the complexity of sorting vector data (and requiring codec changes to store vector data globally, sharing it in ways we didn't really anticipate, ie unrelated vector fields will now be stored together).
I guess I'm still stuck on the "view" idea. What would be the problem with having one field reference another? I guess when creating a reader for the "view" field it would have to delegate to a flat vectors reader from the other field, and it would require an ordinal mapping (so that its graph can have dense ordinals while still referring to ordinals from the other field for value lookups), or else support graphs with non-dense ordinals. Aside from all that, does Lucene somehow have a guarantee that there are no inter-field dependencies? I can't think of any way it does. Its indexing is entirely document-centric; it enforces that the field definitions are immutable (so if field B depends on A and A's definition changes that could conceivably be a problem, but it doesn't arise).
Lucene does not write/merge fields in a well defined order, and with a view you really need that field to be written/merged "last" so that you can view the contents of the field(s) that you depend on. In a crude straw man proposal you might have an is_view annotation on a field, that field cannot be written to with a document, and writing/merging these fields happens after all the "regular" fields complete. It's pretty easy to imagine the existing hnsw infrastructure working in something like this since it mirrors the split between flat knn value formats and the graph build, the difference being that these are now exposed to Lucene as two fields instead of just one.
Was thinking of another use case for this, which is building multiple HNSW graphs with different parameters from the same underlying vector data. Could be useful, eg, for tuning hyperparameters.