docling icon indicating copy to clipboard operation
docling copied to clipboard

Add option to export_to_markdown to mark page breaks

Open cau-git opened this issue 1 year ago • 11 comments

As suggested in this discussion, we should add a placeholder feature for page breaks, the same way we support placeholders for pictures.

  • The placeholder text for page breaks should by default contain the page number.
  • It should be disabled as a default.

cau-git avatar Nov 11 '24 15:11 cau-git

Hi @cau-git

I am interested in working on this issue.

chakravarthik27 avatar Nov 12 '24 07:11 chakravarthik27

@chakravarthik27 Thanks, we would welcome that you make a contribution for this issue!

As a starting point, it would require extension of this method.

cau-git avatar Nov 12 '24 08:11 cau-git

@chakravarthik27 Hi, I'm also very interested in resolving this issue. I've found a solution that works in my case, though I'm not sure if it’s a general fix. Would you mind if I also worked on this?

sunwoongc avatar Nov 15 '24 00:11 sunwoongc

@chakravarthik27 Absolutely not, go ahead! Just make a PR in https://github.com/DS4SD/docling-core!

PeterStaar-IBM avatar Nov 16 '24 07:11 PeterStaar-IBM

I'm getting confusion, Still I didn't started, so please continue @sunwoongc

chakravarthik27 avatar Nov 16 '24 08:11 chakravarthik27

@sunwoongc @chakravarthik27 Please coordinate with each other for the markdown pagebreaks and let us know when you expect it to be done.

PeterStaar-IBM avatar Nov 18 '24 08:11 PeterStaar-IBM

@chakravarthik27 @PeterStaar-IBM

Thank you for your input!

I noticed a simple but important detail: most items inheriting from DocItem have an attribute called prov, which includes a page_no field for tracking provenance. For reference, here's the ProvenanceItem.

However, the GroupItem class lacks this attribute, as it's designated as a container type. See GroupItem.

To handle this in the export_to_markdown function, I've added the following code:

prev_page_no = -1
page_change_flag = False
for ix, (item, level) in enumerate(doc.iterate_items(doc.body, with_groups=True)):  
    if not isinstance(item, GroupItem):
        cur_page_no = item.prov[0].page_no
        if prev_page_no != cur_page_no:
            page_change_flag = True
        else:
            page_change_flag = False

        # Append text if page has changed
        if page_change_flag:
            mdtexts.append(f"Page {cur_page_no}")

        # Update previous page number after handling change
        prev_page_no = cur_page_no

I'm concerned this solution may encounter edge cases. If you have any suggestions or foresee potential issues, I'd appreciate your feedback!

sunwoongc avatar Nov 19 '24 00:11 sunwoongc

Is this deployed in the current version?

simjak avatar Dec 12 '24 07:12 simjak

Any update on this feature please ?

calls9-amirbraham avatar Dec 27 '24 16:12 calls9-amirbraham

Any update on this. i want this feature to have pagenumber on my chunks for future reference through llm

amal5haji avatar Jan 29 '25 14:01 amal5haji

Page breaks (numbers) are going to be really helpful in the output when the entire context of the document is being sent to LLM without any chunk and can be used for generating references/citations. Docling is really performing well over unstructured in most cases and will provide a feature parity with page numbers.

rhlarora84 avatar Feb 17 '25 17:02 rhlarora84

Any update on page markers on markdown?

amal5haji avatar Mar 06 '25 16:03 amal5haji

I'm working on a project that requires creating a deep link to the original PDF (directly to the page where the chunk is located).

Does anyone have a solution to share?

My original approach would be to use something like PyPDF to split the document into individual pages and then use docling to convert it to Markdown, inserting a comment with the page number. This also makes me wonder about chunk size and how to ensure that each chunk contains a comment with the page number...

Any ideas?

diegovelezg avatar Mar 09 '25 14:03 diegovelezg

Also implemented it in docling-core. Can do a pull-request, if no one else is working on it?

Adolar13 avatar Mar 09 '25 15:03 Adolar13

@Adolar13 Go ahead!

PeterStaar-IBM avatar Mar 10 '25 06:03 PeterStaar-IBM

How can i use it? Is it merged?

amal5haji avatar Mar 11 '25 15:03 amal5haji

Finally opened a pr here: https://github.com/docling-project/docling-core/pull/194

Took a bit longer due to recent redesign with serializers

Adolar13 avatar Mar 14 '25 21:03 Adolar13

Markdown page breaks added with https://github.com/docling-project/docling-core/pull/213 and released with docling-core v2.24.0.

Usage can be seen here.

vagenas avatar Mar 25 '25 22:03 vagenas

I'm confused, why don't you guys just run a conversion from the json output to markdown...? You can keep everything this way lol 😂

QiTianDaSh3ng avatar Mar 28 '25 01:03 QiTianDaSh3ng

@vagenas thank you for contributing page breaks

Wanted to inquire if page numbers are coming, in our use case users need to reference a page number of a document

anuar12 avatar Apr 25 '25 13:04 anuar12

@vagenas thank you for contributing page breaks

Wanted to inquire if page numbers are coming, in our use case users need to reference a page number of a document

Hey, needed something similar today, so I used page breaks and numerated the pages with something like this

markdown_output = doc.export_to_markdown(page_break_placeholder="<-- Page Break -->")

markdown_with_pages = markdown_output
page_index = 1
while "<-- Page Break -->" in markdown_with_pages:
    markdown_with_pages = markdown_with_pages.replace("<-- Page Break -->", f"\n\n--- Page {page_index} ---\n\n", 1)
    page_index += 1
markdown_with_pages +=f"\n\n--- Page {page_index} ---\n\n" # for final page

I'll think about proposing an integration of this in the future to an official docling version, ofc after optimizing it....

Ouassim-Hamdani avatar Nov 04 '25 23:11 Ouassim-Hamdani