document_store.update_embeddings seems to update embeddings regardless of parameter
I'm using qdrant-haystack 1.0.11 with farm-haystack==1.21.2 and python 3.10.13 on Win10 and Qdrant running in Docker.
When updating the embeddings of a document store, document_store.update_embeddings seems to update all embeddings even when update_existing_embeddings is set to False.
I'm running this code:
import timeit
from haystack import Document
from haystack.nodes import EmbeddingRetriever
from qdrant_haystack.document_stores import QdrantDocumentStore
def update_embeddings(existing):
document_store.update_embeddings(retriever, update_existing_embeddings=existing)
document_store = QdrantDocumentStore(url="localhost", index="test_update_embeddings",
embedding_dim=512, similarity="cosine")
retriever = EmbeddingRetriever(document_store=document_store,
embedding_model="sentence-transformers/distiluse-base-multilingual-cased-v1",
use_gpu=False)
docs_to_index = [Document(content=str(i) + " random text"*100) for i in range(0, 50)]
document_store.write_documents(docs_to_index, duplicate_documents="skip")
res_upd = timeit.timeit(stmt='update_embeddings(True)', globals=globals(), number=2)
res_noupd = timeit.timeit(stmt='update_embeddings(False)', globals=globals(), number=2)
print(f"Execution with update: {res_upd}, with no update: {res_noupd}")
After the execution the QDrant database contains 50 vectors, as expected.
I would also expect that update_embeddings(False) is running significantly faster than update_embeddings(True), but both statements run for nearly the same time:
Execution with update: 22.15771689999383, with no update: 20.913242900016485
To me this looks like update_embeddings(..., update_existing_embeddings=False) is updating the embeddings, too.
What am I missing?
I've just found this comment in the relevant source file:
:param update_existing_embeddings: Not used by QdrantDocumentStore, as all the points
must have a corresponding vector in Qdrant.
So for my use case:
- Precondition: qdrant contains x documents and corresponding embeddings
- Actions
- Get n new documents
- write n documents to qdrant
- update only n new documents embeddings using update_embeddings
using update_embeddings does not work.
So a working use case would be
- Precondition: qdrant contains x documents and corresponding embeddings
- Actions
- Get n new documents
- create n new embeddings manually for all new documents
- write n documents to qdrant (as write documents does not check the validity of the embeddings as far as I've understood).
So update_embeddings is basically useful only when I change the model generating the embeddings? This seems somehow a little bit against the intent of having a simple pipeline, at least to me.