fix: Adding `namespace` variable to `retriever()` and `retriever_batch()`
Added the variable namespace to each of the execution methods, retriever() and retriever_batch() for retrieving data from the correct partition corresponding to the selected Index.
Related Issues
- fixes #issue-number
Proposed Changes:
Without adding the namespace variable to the method, if the user has multiple partitions inside an Index, it will cause miscommunication with the retriever. In some cases, the retriever will return an empty list.
How did you test it?
Unit tests, integration tests
Notes for the reviewer
@ZanSara knows about this case.
Checklist
- [X] I have read the contributors guidelines and the code of conduct
- [X] I have updated the related issue with new insights and changes
- [ ] I added tests that demonstrate the correct behavior of the change
- [X] I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:. - [X] I documented my code
- [ ] I ran pre-commit hooks and fixed any issue
@AI-Ahmed, how's this one coming along, anything we can help with?
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
@ZanSara Ahmed says you are familiar with this use case. Any suggestions on what we should do here in case @AI-Ahmed doesn't update this PR?
For sure, @vblagoje! So, As you know, we have different types of vector database components and vector search components in Haystack. Some of them, such as PineconeDocumentStore and QdrantDocumentStore (if I remember correctly), require a namespace parameter, unlike other Vector Search, such as ElisticSearchDocumentStore or OpenSearchDocumentStore, which are not supporting this kind of feature.
This parameter namespace allows you to partition your vector database index into partitions, which allows a more personalized user experience if the user has one index to deal with. Therefore, it helps to produce faster QPS and also controlled indices.
As I mentioned in the first paragraph, some document stores support that. Still, the others don't, so adding this parameter to the Retriever component retrieve() and batch_retrieve() functions are something fundamental: First, it will reduce the QPS, Second and consider the most important thing, it will end the problem of the recursive checking of the document when the user Upsert/Update documents/vectors to the index.
For instance, when you create an index on Pinecone and upsert your first file using Haystack DocumentStore, you won't feel any problem. Still, If you upserted the second file, you would notice that the upsert reinforces retrieving all the documents from Pinecone first, then comparing them with your second upserted file, then moving all of them to a third namespace (partition ) for this process, then get them back again to the default namespace which is called vectors.
This process overloads the system when you have too many vectors; plus, it is time and computationally costly.
Therefore, I proposed to add the namespace parameter to the method retrieve() in the object DenseRetriever so that any inherited objects, whether EmbeddingRetriever or other retrievers that take argument DocumentStore which support the namespace feature such as PineconeDocumentStore_query_by_embedding in their framework, it would help the user to use both namespace and filter to upsert/update only the proposed/selected documents without recursively checking all the other vectors.
I did already commit this, but I didn't have afterwardards to do the pre-commit since it takes too much time, but here I explained everything needed for this problem to be solved. I hope I was clear in my explanations.
@AI-Ahmed yes you were clear about what you want to achieve; thank you! What I am not clear about is how do we proceed from this point? It seems that unit tests are missing, and we have to understand the impact of passing the additional parameter to non-supporting vector databases, how we should perhaps ignore it in non-supporting databases etc etc. Please let us know.
Closing for lack of activity