haystack icon indicating copy to clipboard operation
haystack copied to clipboard

`Pipeline.run_batch()` fails on indexing

Open ZanSara opened this issue 3 years ago • 1 comments

Describe the bug

  • Pipeline.run_batch() fails on indexing pipelines.
  • The failure is early and seems to occur in the run_batch method itself
  • The failure seems trivial

Error message

Traceback (most recent call last):
  File "/home/sara/work/haystack/examples/example2.py", line 40, in <module>
    pipe.run_batch(file_paths=next(os.walk('examples'))[2])
  File "/home/sara/work/haystack/haystack/pipelines/base.py", line 612, in run_batch
    documents=flattened_documents,
UnboundLocalError: local variable 'flattened_documents' referenced before assignment

To Reproduce

import os

from haystack import Pipeline
from haystack.nodes import FileTypeClassifier, TextConverter
from haystack.document_stores import InMemoryDocumentStore

pipe = Pipeline()

pipe.add_node(name="classifier", component=FileTypeClassifier(supported_types=["py", "sh", "png", "yml"]), inputs=["File"])
pipe.add_node(name="py-converter", component=TextConverter(), inputs=["classifier.output_1"])
pipe.add_node(name="sh-converter", component=TextConverter(), inputs=["classifier.output_2"])
pipe.add_node(name="yml-converter", component=TextConverter(), inputs=["classifier.output_4"])
pipe.add_node(name="docstore", component=InMemoryDocumentStore(), inputs=["py-converter", "sh-converter", "yml-converter"])

docs_to_index = next(os.walk('examples'))[2]  # Substitute the path to reproduce
print("Docs to index:")
for doc in docs_to_index:
    print(f" - {doc}")

pipe.run_batch(file_paths=docs_to_index)

ZanSara avatar Aug 08 '22 08:08 ZanSara

https://github.com/deepset-ai/haystack/blob/c91316e862c3fb751b3e8996ddd5f99b5563ae81/haystack/pipelines/base.py#L558-L616

This part seems dedicated to excluding indexing Pipelines from run_batch, using simple run. I tried to define flattened_documents just before this condition: https://github.com/deepset-ai/haystack/blob/c91316e862c3fb751b3e8996ddd5f99b5563ae81/haystack/pipelines/base.py#L603 and the UnboundLocalError isn't raised.

But now, if the directory contains mixed types of files, I get other errors probably related to FileTypeClassifier. Reported in #2999

anakin87 avatar Aug 08 '22 19:08 anakin87

Just... wow. I wasn't aware of this catch in run_batch for indexing. Thank you so much for highlighting it :pray:

The fix you found is probably sufficient by the way: I guess this code path was just left untested :smiling_face_with_tear: Thank you for checking it out, feel free to open a PR for this little change alone.

ZanSara avatar Aug 10 '22 09:08 ZanSara