docling
docling copied to clipboard
Sample chunking notebook that includes merging, etc.
Some key differences between this proposed chunking notebook and the one in advanced_chunking.ipynb:
- This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
- This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
- This one uses the
DoclingDocument.nameas the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. TheDoclingDocument.namecomes from document metadata and sometimes also reflects the title but is often not very useful. - This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
- This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.
@vagenas Let's review together later today. I do not see any blocker so far.