Forbid the key `id` from `Document`s to be written in `WeaviateDocumentStore`
Whenever the new documents are pushed to the weaviate the structure is flattened to eliminate "meta" info. This is dangerous and could lead to unexpected id replacement(s). This PR suggest to raise an error whenever the duplicate id(s) are found within "meta".
Minimal example to reproduce
from haystack.schema import Document
from haystack.document_stores import WeaviateDocumentStore
doc = Document.from_dict({"content": "abracadabra", "meta": {"id": 123456789}})
store = WeaviateDocumentStore()
store.write_documents([doc]) # ValueError: Not valid 'uuid' or 'uuid' can not be extracted from value
Proposed changes:
- Raise an error whenever the duplicate key-value
Hello @thenewera-ru! I admit it took me a while to understand this bug :smile: Let me explain.
At first, it looked to me like it was the snippet that was wrong. In fact, if you change the third line with
doc = Document.from_dict({"content": "abracadabra", "id": 123456789})
it will work as expected. Documents have the id as a top-level attribute, and placing it in the meta makes little sense from a Haystack perspective. However, this bug is reproducible, which means that Weaviate (probably) stores these IDs in the metadata internally, and setting them before-hand break the integration.
So here is my proposal: how about checking if there's a key named ID in the metadata, to prevent this bug? Optimally I would raise a meaningful error, something like WeaviateDocumentStoreError('You can't have a field called id in the metadata of a Document. Please set the ID as a top-level attribute').
For other duplicate keys, instead, I think a warning is more than sufficient. We might even not check at all imho.
(I edited your issue name for clarity. We use issue names for the release notes)
Hello @thenewera-ru! I admit it took me a while to understand this bug 😄 Let me explain.
At first, it looked to me like it was the snippet that was wrong. In fact, if you change the third line with
doc = Document.from_dict({"content": "abracadabra", "id": 123456789})it will work as expected. Documents have the id as a top-level attribute, and placing it in the meta makes little sense from a Haystack perspective. However, this bug is reproducible, which means that Weaviate (probably) stores these IDs in the metadata internally, and setting them before-hand break the integration.
So here is my proposal: how about checking if there's a key named ID in the metadata, to prevent this bug? Optimally I would raise a meaningful error, something like
WeaviateDocumentStoreError('You can't have a field called id in the metadata of a Document. Please set the ID as a top-level attribute').
I agree with you but I myself faced the bug while not expecting that "meta" could potentially contain id. Usually, you expect only top-level structure document to have an id. Sometimes, though, you may encounter document with both top-level id and within the "meta". Something like:
doc = Document.from_dict({"text": "abracdabra", "id": ..., "meta": {"id": 123456789}})
In such case I would suggest to either raise an Exception or ignore that key. I would vote for exception because you might expect this information in the future and the cost of bug at runtime seems to be tough. Better catch it at indexing step? What do you think?
The problem is not particularly with the id field. The same bug could be with the "content" field appearing both in top-level structure and within the meta. What a surprise to successfully index the document with the embedding corresponding to the top-level "content" text and later receive completely wrong text - the one that replaced the top-level content by meta content.
In such case I would suggest to either raise an Exception or ignore that key. I would vote for exception because you might expect this information in the future and the cost of bug at runtime seems to be tough
Agree! An exception thrown by write_documents would be best. Documents should not have a key called id in their meta. That can only cause confusion. If the users need to store another id of any kind (maybe unrelated to Haystack), they should simply name the key differently.
The problem is not particularly with the id field. The same bug could be with the "content" field appearing both in top-level structure and within the meta. What a surprise to successfully index the document with the embedding corresponding to the top-level "content" text and later receive completely wrong text - the one that replaced the top-level content by meta content.
You're right. I think it makes sense to check that no keys defined top-level are present in the meta and raise the same exception in all cases.
Hello @thenewera-ru! There's a Haystack release coming up next Monday. Do you think we can merge this fix by then? Shall I take over and do the last changes? Let me know :slightly_smiling_face: