haystack
haystack copied to clipboard
Extract headings from Markdown files
Describe the solution you'd like
Markdown files tag headlines with the # character. Different levels of headings are marked by the number of # characters. We should use this information and add headline information to the documents metadata. This information might be used to improve retrieval, for example.
The number of # at the beginning of a line should be quite easy to detect. Alternatively, we use the heading information from the HTML that we produce in an intermediate step when converting markdown files.
When adding this, we need to make sure in the PreProcessor's split method to keep only those headings that are relevant for each split.