Sample chunking notebook that includes merging, etc.

Open jwm4 opened this issue 1 year ago • 1 comments

Some key differences between this proposed chunking notebook and the one in advanced_chunking.ipynb:

This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
This one uses the DoclingDocument.name as the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. The DoclingDocument.name comes from document metadata and sometimes also reflects the title but is often not very useful.
This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.

Nov 01 '24 12:11 jwm4

@vagenas Let's review together later today. I do not see any blocker so far.

Nov 05 '24 06:11 PeterStaar-IBM