haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Document stores won't warn users if they're trying to extract Labels from a document index and vice-versa

Open ZanSara opened this issue 3 years ago • 0 comments

Describe the bug

  • Calling docstore.get_all_labelsin an index populated with Documents will fail with a quite obscure Pydantic validation error.

Error message

Traceback (most recent call last):
  File "/home/sara/work/haystack/haystack/document_stores/elasticsearch.py", line 691, in get_all_labels
    labels = [Label.from_dict({**hit["_source"], "id": hit["_id"]}) for hit in result]
  File "/home/sara/work/haystack/haystack/document_stores/elasticsearch.py", line 691, in <listcomp>
    labels = [Label.from_dict({**hit["_source"], "id": hit["_id"]}) for hit in result]
  File "/home/sara/work/haystack/haystack/schema.py", line 552, in from_dict
    return _pydantic_dataclass_from_dict(dict=dict, pydantic_dataclass_type=cls)
  File "/home/sara/work/haystack/haystack/schema.py", line 720, in _pydantic_dataclass_from_dict
    base_model = pydantic_dataclass_type.__pydantic_model__.parse_obj(dict)
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 5 validation errors for Label
query
  field required (type=value_error.missing)
document
  field required (type=value_error.missing)
is_correct_answer
  field required (type=value_error.missing)
is_correct_document
  field required (type=value_error.missing)
origin
  field required (type=value_error.missing)

Expected behavior The cause of the error is obvious, so I'd expect the docstore to warn the user of the mistake. Something like:

Failed to create labels from the content of index 'eval_docs'. Are you sure this index contains labels?

Additional context When doing evaluation, different index will contain different data types, so a clear error message can help debugging a lot.

To Reproduce

from haystack.utils import launch_es
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import PreProcessor

launch_es()

doc_index = "eval_docs"
label_index = "eval_labels"

document_store = ElasticsearchDocumentStore(host="localhost",
                                            username="",
                                            password="",
                                            index=doc_index,
                                            label_index=label_index,
                                            recreate_index=True)

preprocessor = PreProcessor(
    split_by="word",
    split_length=200,
    split_overlap=0,
    split_respect_sentence_boundary=False,
    clean_empty_lines=False,
    clean_whitespace=False,
)

document_store.add_eval_data(
    filename="data/tutorial5/nq_dev_subset_v2.json",
    doc_index=doc_index,
    label_index=label_index,
    preprocessor=preprocessor,
)

eval_labels = document_store.get_all_labels_aggregated(drop_negative_labels=True, drop_no_answers=True, index="eval_docs")

ZanSara avatar Aug 08 '22 11:08 ZanSara