docling icon indicating copy to clipboard operation
docling copied to clipboard

Update advanced_chunking_with_merging.ipynb

Open bash99 opened this issue 1 year ago • 1 comments

fix bugs which use new heading and captions for current merged chunks

I think the error in following code is clear, it create new_meta use headings_and_captions of mismatched chunk to create the merged new chunk, which cause a offset by 1 error. I changed it to current_headings_and_captions instead.

            else:
                # no more room OR the start of new metadata.  Either way, end the block and use the current window_end as the start of a new block
                if window_start + 1 == window_end:
                    # just one chunk so use it as is
                    output_chunks.append(first_chunk_of_window)
                else:
                    new_meta = DocMeta(
                        doc_items=window_items,
                        headings=headings_and_captions[0],
                        captions=headings_and_captions[1],
                    )
                    new_chunk = DocChunk.from_data(
                        text=window_text,
                        meta=new_meta,
                        delim=self.delim,
                    )
                    output_chunks.append(new_chunk)

bash99 avatar Dec 03 '24 09:12 bash99

@bash99 nice catch! Since we have been working on providing the new chunker within https://github.com/DS4SD/docling-core/pull/68, and to speed things up, I added your fix directly there, giving author credit via the respective commit message trailer. I hope this is ok for you?

vagenas avatar Dec 05 '24 10:12 vagenas

@bash99 nice catch! Since we have been working on providing the new chunker within DS4SD/docling-core#68, and to speed things up, I added your fix directly there, giving author credit via the respective commit message trailer. I hope this is ok for you?

Yes, thanks for your great project.

bash99 avatar Dec 06 '24 02:12 bash99