Update advanced_chunking_with_merging.ipynb
fix bugs which use new heading and captions for current merged chunks
I think the error in following code is clear, it create new_meta use headings_and_captions of mismatched chunk to create the merged new chunk, which cause a offset by 1 error. I changed it to current_headings_and_captions instead.
else:
# no more room OR the start of new metadata. Either way, end the block and use the current window_end as the start of a new block
if window_start + 1 == window_end:
# just one chunk so use it as is
output_chunks.append(first_chunk_of_window)
else:
new_meta = DocMeta(
doc_items=window_items,
headings=headings_and_captions[0],
captions=headings_and_captions[1],
)
new_chunk = DocChunk.from_data(
text=window_text,
meta=new_meta,
delim=self.delim,
)
output_chunks.append(new_chunk)
@bash99 nice catch! Since we have been working on providing the new chunker within https://github.com/DS4SD/docling-core/pull/68, and to speed things up, I added your fix directly there, giving author credit via the respective commit message trailer. I hope this is ok for you?
@bash99 nice catch! Since we have been working on providing the new chunker within DS4SD/docling-core#68, and to speed things up, I added your fix directly there, giving author credit via the respective commit message trailer. I hope this is ok for you?
Yes, thanks for your great project.