haystack
haystack copied to clipboard
File_path in meta should optionally be just the filename
When indexing, file_path currently passed to meta contains the absolute path used as the source.
This potentially leads to a situation where full paths containing usernames are then stored in a document store, e.g.,
-0.013439087197184563,
-0.05053149536252022,
0.011438718996942043
],
"sparse_embedding": null,
"file_path": "/Users/<redacted-user-name>/brochure123.pdf",
"source_id": "b2a1aa616d9d6901c1559601c515318e8fc4f8c4f242414c745ef019a3c0eb50",
"page_number": 1,
"split_id": 0,
It would be great to have an option to only store the filename, not the full path in meta.
I believe, currently a workaround would be along the lines of using the following (here is how to remove file_path from meta):
@component
class FilePathRemover:
@component.output_types(documents=List[Document])
def run(self, docs: List[Document]):
documents_copy = copy.deepcopy(documents)
for doc in documents_copy:
del doc.meta["file_path"]
return {"documents": documents_copy}