haystack
haystack copied to clipboard
Document stores won't warn users if they're trying to extract Labels from a document index and vice-versa
Describe the bug
- Calling
docstore.get_all_labelsin an index populated with Documents will fail with a quite obscure Pydantic validation error.
Error message
Traceback (most recent call last):
File "/home/sara/work/haystack/haystack/document_stores/elasticsearch.py", line 691, in get_all_labels
labels = [Label.from_dict({**hit["_source"], "id": hit["_id"]}) for hit in result]
File "/home/sara/work/haystack/haystack/document_stores/elasticsearch.py", line 691, in <listcomp>
labels = [Label.from_dict({**hit["_source"], "id": hit["_id"]}) for hit in result]
File "/home/sara/work/haystack/haystack/schema.py", line 552, in from_dict
return _pydantic_dataclass_from_dict(dict=dict, pydantic_dataclass_type=cls)
File "/home/sara/work/haystack/haystack/schema.py", line 720, in _pydantic_dataclass_from_dict
base_model = pydantic_dataclass_type.__pydantic_model__.parse_obj(dict)
File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 5 validation errors for Label
query
field required (type=value_error.missing)
document
field required (type=value_error.missing)
is_correct_answer
field required (type=value_error.missing)
is_correct_document
field required (type=value_error.missing)
origin
field required (type=value_error.missing)
Expected behavior The cause of the error is obvious, so I'd expect the docstore to warn the user of the mistake. Something like:
Failed to create labels from the content of index 'eval_docs'. Are you sure this index contains labels?
Additional context When doing evaluation, different index will contain different data types, so a clear error message can help debugging a lot.
To Reproduce
from haystack.utils import launch_es
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import PreProcessor
launch_es()
doc_index = "eval_docs"
label_index = "eval_labels"
document_store = ElasticsearchDocumentStore(host="localhost",
username="",
password="",
index=doc_index,
label_index=label_index,
recreate_index=True)
preprocessor = PreProcessor(
split_by="word",
split_length=200,
split_overlap=0,
split_respect_sentence_boundary=False,
clean_empty_lines=False,
clean_whitespace=False,
)
document_store.add_eval_data(
filename="data/tutorial5/nq_dev_subset_v2.json",
doc_index=doc_index,
label_index=label_index,
preprocessor=preprocessor,
)
eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True, index="eval_docs")