haystack icon indicating copy to clipboard operation
haystack copied to clipboard

File_path in meta should optionally be just the filename

Open aantti opened this issue 1 year ago • 0 comments

When indexing, file_path currently passed to meta contains the absolute path used as the source.

This potentially leads to a situation where full paths containing usernames are then stored in a document store, e.g.,

            -0.013439087197184563,
            -0.05053149536252022,
            0.011438718996942043
          ],
          "sparse_embedding": null,
          "file_path": "/Users/<redacted-user-name>/brochure123.pdf",
          "source_id": "b2a1aa616d9d6901c1559601c515318e8fc4f8c4f242414c745ef019a3c0eb50",
          "page_number": 1,
          "split_id": 0,

It would be great to have an option to only store the filename, not the full path in meta.

I believe, currently a workaround would be along the lines of using the following (here is how to remove file_path from meta):

@component
class FilePathRemover:
    @component.output_types(documents=List[Document])
    def run(self, docs: List[Document]):
        documents_copy = copy.deepcopy(documents)
    
        for doc in documents_copy:
            del doc.meta["file_path"]
        return {"documents": documents_copy}

aantti avatar Oct 07 '24 10:10 aantti