qdrant-haystack icon indicating copy to clipboard operation
qdrant-haystack copied to clipboard

document_store.update_embeddings seems to update embeddings regardless of parameter

Open theoky opened this issue 2 years ago • 1 comments

I'm using qdrant-haystack 1.0.11 with farm-haystack==1.21.2 and python 3.10.13 on Win10 and Qdrant running in Docker.

When updating the embeddings of a document store, document_store.update_embeddings seems to update all embeddings even when update_existing_embeddings is set to False.

I'm running this code:

import timeit
from haystack import Document
from haystack.nodes import EmbeddingRetriever
from qdrant_haystack.document_stores import QdrantDocumentStore

def update_embeddings(existing):
    document_store.update_embeddings(retriever, update_existing_embeddings=existing)
    
document_store = QdrantDocumentStore(url="localhost", index="test_update_embeddings",
                                    embedding_dim=512, similarity="cosine")

retriever = EmbeddingRetriever(document_store=document_store,
                               embedding_model="sentence-transformers/distiluse-base-multilingual-cased-v1",
                               use_gpu=False)

docs_to_index = [Document(content=str(i) + " random text"*100) for i in range(0, 50)]

document_store.write_documents(docs_to_index, duplicate_documents="skip")

res_upd = timeit.timeit(stmt='update_embeddings(True)', globals=globals(), number=2) 
res_noupd = timeit.timeit(stmt='update_embeddings(False)', globals=globals(), number=2)

print(f"Execution with update: {res_upd}, with no update: {res_noupd}")

After the execution the QDrant database contains 50 vectors, as expected.

I would also expect that update_embeddings(False) is running significantly faster than update_embeddings(True), but both statements run for nearly the same time: Execution with update: 22.15771689999383, with no update: 20.913242900016485

To me this looks like update_embeddings(..., update_existing_embeddings=False) is updating the embeddings, too.

What am I missing?

theoky avatar Nov 27 '23 19:11 theoky

I've just found this comment in the relevant source file:

:param update_existing_embeddings: Not used by QdrantDocumentStore, as all the points
                                   must have a corresponding vector in Qdrant.

So for my use case:

  • Precondition: qdrant contains x documents and corresponding embeddings
  • Actions
    • Get n new documents
    • write n documents to qdrant
    • update only n new documents embeddings using update_embeddings

using update_embeddings does not work.

So a working use case would be

  • Precondition: qdrant contains x documents and corresponding embeddings
  • Actions
    • Get n new documents
    • create n new embeddings manually for all new documents
    • write n documents to qdrant (as write documents does not check the validity of the embeddings as far as I've understood).

So update_embeddings is basically useful only when I change the model generating the embeddings? This seems somehow a little bit against the intent of having a simple pipeline, at least to me.

theoky avatar Dec 01 '23 08:12 theoky