haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Update embeddings via document IDs

Open atreyasha opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe.

I am building a REST API using HayStack pipelines. One of my endpoints creates, writes and indexes documents submitted by users into an OpenSearch document store. The specific endpoint looks roughly like this:

@app.post("/documents")
def create_document(
    document: IncomingDocument,
    opensearch=Depends(get_opensearch),
    retriever=Depends(get_retriever),
):
    document = convert_to_haystack_document(document)
    opensearch.write_documents([document])
    opensearch.update_embeddings(retriever, update_existing_embeddings=False)

The idea would be to incrementally update the embeddings and the corresponding KNN index as users insert documents.

Upon doing some stress tests, I came across several indexing errors of type BulkIndexError from OpenSearch with the following keyword: version_conflict_engine_exception. This is similar to the following SO post.

This error is likely caused by race conditions when several gunicorn workers update the same document embeddings simultaneously (because there were missing embeddings when queried by a worker, even though some embedding tasks were underway by another worker).

Describe the solution you'd like

Since the ID of the created document is known, it would be great if the update_embedding method could accept a list of document IDs whose embeddings should be updated. This way, each gunicorn worker would only update its responsible document IDs and the aforementioned race conditions would not occur.

Describe alternatives you've considered

Unsuccessful workaround:

I tried to use the filters argument of the update_embeddings method but it AFAIK it only applies to metadata and not the id field of an OpenSearch index.

Successful workaround:

At the moment, I store an external ID for each document in the meta field. With this external ID in meta, I am able to use the filters argument to update_embeddings to achieve my goal. But I think a more direct solution would be either:

  1. update_embeddings accepts a list of IDs OR
  2. filters could be used on the id field on top of the meta field (not sure of the complexity or side-effects of this)

atreyasha avatar Jul 11 '23 13:07 atreyasha