haystack fix: Adding `namespace` variable to `retriever()` and `retriever

Added the variable namespace to each of the execution methods, retriever() and retriever_batch() for retrieving data from the correct partition corresponding to the selected Index.

Related Issues

fixes #issue-number

Proposed Changes:

Without adding the namespace variable to the method, if the user has multiple partitions inside an Index, it will cause miscommunication with the retriever. In some cases, the retriever will return an empty list.

How did you test it?

Unit tests, integration tests

Notes for the reviewer

@ZanSara knows about this case.

Checklist

[X] I have read the contributors guidelines and the code of conduct
[X] I have updated the related issue with new insights and changes
[ ] I added tests that demonstrate the correct behavior of the change
[X] I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
[X] I documented my code
[ ] I ran pre-commit hooks and fixed any issue

Apr 06 '23 14:04 AI-Ahmed

@AI-Ahmed, how's this one coming along, anything we can help with?

Apr 18 '23 14:04 vblagoje

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

May 18 '23 10:05 CLAassistant

@ZanSara Ahmed says you are familiar with this use case. Any suggestions on what we should do here in case @AI-Ahmed doesn't update this PR?

Jun 07 '23 08:06 vblagoje

For sure, @vblagoje! So, As you know, we have different types of vector database components and vector search components in Haystack. Some of them, such as PineconeDocumentStore and QdrantDocumentStore (if I remember correctly), require a namespace parameter, unlike other Vector Search, such as ElisticSearchDocumentStore or OpenSearchDocumentStore, which are not supporting this kind of feature.

This parameter namespace allows you to partition your vector database index into partitions, which allows a more personalized user experience if the user has one index to deal with. Therefore, it helps to produce faster QPS and also controlled indices.

As I mentioned in the first paragraph, some document stores support that. Still, the others don't, so adding this parameter to the Retriever component retrieve() and batch_retrieve() functions are something fundamental: First, it will reduce the QPS, Second and consider the most important thing, it will end the problem of the recursive checking of the document when the user Upsert/Update documents/vectors to the index.

For instance, when you create an index on Pinecone and upsert your first file using Haystack DocumentStore, you won't feel any problem. Still, If you upserted the second file, you would notice that the upsert reinforces retrieving all the documents from Pinecone first, then comparing them with your second upserted file, then moving all of them to a third namespace (partition ) for this process, then get them back again to the default namespace which is called vectors.

This process overloads the system when you have too many vectors; plus, it is time and computationally costly.

Therefore, I proposed to add the namespace parameter to the method retrieve() in the object DenseRetriever so that any inherited objects, whether EmbeddingRetriever or other retrievers that take argument DocumentStore which support the namespace feature such as PineconeDocumentStore_query_by_embedding in their framework, it would help the user to use both namespace and filter to upsert/update only the proposed/selected documents without recursively checking all the other vectors.

I did already commit this, but I didn't have afterwardards to do the pre-commit since it takes too much time, but here I explained everything needed for this problem to be solved. I hope I was clear in my explanations.

Jun 07 '23 09:06 AI-Ahmed

@AI-Ahmed yes you were clear about what you want to achieve; thank you! What I am not clear about is how do we proceed from this point? It seems that unit tests are missing, and we have to understand the impact of passing the additional parameter to non-supporting vector databases, how we should perhaps ignore it in non-supporting databases etc etc. Please let us know.

Jun 18 '23 08:06 vblagoje

Closing for lack of activity

Nov 23 '23 08:11 masci

fix: Adding `namespace` variable to `retriever()` and `retriever_batch()`

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist