haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Extract headings from Markdown files

Open bogdankostic opened this issue 3 years ago • 0 comments

Describe the solution you'd like Markdown files tag headlines with the # character. Different levels of headings are marked by the number of # characters. We should use this information and add headline information to the documents metadata. This information might be used to improve retrieval, for example.

The number of # at the beginning of a line should be quite easy to detect. Alternatively, we use the heading information from the HTML that we produce in an intermediate step when converting markdown files.

When adding this, we need to make sure in the PreProcessor's split method to keep only those headings that are relevant for each split.

bogdankostic avatar Aug 17 '22 12:08 bogdankostic