Add option to export_to_markdown to mark page breaks
As suggested in this discussion, we should add a placeholder feature for page breaks, the same way we support placeholders for pictures.
- The placeholder text for page breaks should by default contain the page number.
- It should be disabled as a default.
Hi @cau-git
I am interested in working on this issue.
@chakravarthik27 Thanks, we would welcome that you make a contribution for this issue!
As a starting point, it would require extension of this method.
@chakravarthik27 Hi, I'm also very interested in resolving this issue. I've found a solution that works in my case, though I'm not sure if it’s a general fix. Would you mind if I also worked on this?
@chakravarthik27 Absolutely not, go ahead! Just make a PR in https://github.com/DS4SD/docling-core!
I'm getting confusion, Still I didn't started, so please continue @sunwoongc
@sunwoongc @chakravarthik27 Please coordinate with each other for the markdown pagebreaks and let us know when you expect it to be done.
@chakravarthik27 @PeterStaar-IBM
Thank you for your input!
I noticed a simple but important detail: most items inheriting from DocItem have an attribute called prov, which includes a page_no field for tracking provenance. For reference, here's the ProvenanceItem.
However, the GroupItem class lacks this attribute, as it's designated as a container type. See GroupItem.
To handle this in the export_to_markdown function, I've added the following code:
prev_page_no = -1
page_change_flag = False
for ix, (item, level) in enumerate(doc.iterate_items(doc.body, with_groups=True)):
if not isinstance(item, GroupItem):
cur_page_no = item.prov[0].page_no
if prev_page_no != cur_page_no:
page_change_flag = True
else:
page_change_flag = False
# Append text if page has changed
if page_change_flag:
mdtexts.append(f"Page {cur_page_no}")
# Update previous page number after handling change
prev_page_no = cur_page_no
I'm concerned this solution may encounter edge cases. If you have any suggestions or foresee potential issues, I'd appreciate your feedback!
Is this deployed in the current version?
Any update on this feature please ?
Any update on this. i want this feature to have pagenumber on my chunks for future reference through llm
Page breaks (numbers) are going to be really helpful in the output when the entire context of the document is being sent to LLM without any chunk and can be used for generating references/citations. Docling is really performing well over unstructured in most cases and will provide a feature parity with page numbers.
Any update on page markers on markdown?
I'm working on a project that requires creating a deep link to the original PDF (directly to the page where the chunk is located).
Does anyone have a solution to share?
My original approach would be to use something like PyPDF to split the document into individual pages and then use docling to convert it to Markdown, inserting a comment with the page number. This also makes me wonder about chunk size and how to ensure that each chunk contains a comment with the page number...
Any ideas?
Also implemented it in docling-core. Can do a pull-request, if no one else is working on it?
@Adolar13 Go ahead!
How can i use it? Is it merged?
Finally opened a pr here: https://github.com/docling-project/docling-core/pull/194
Took a bit longer due to recent redesign with serializers
Markdown page breaks added with https://github.com/docling-project/docling-core/pull/213 and released with docling-core v2.24.0.
Usage can be seen here.
I'm confused, why don't you guys just run a conversion from the json output to markdown...? You can keep everything this way lol 😂
@vagenas thank you for contributing page breaks
Wanted to inquire if page numbers are coming, in our use case users need to reference a page number of a document
@vagenas thank you for contributing page breaks
Wanted to inquire if page numbers are coming, in our use case users need to reference a page number of a document
Hey, needed something similar today, so I used page breaks and numerated the pages with something like this
markdown_output = doc.export_to_markdown(page_break_placeholder="<-- Page Break -->")
markdown_with_pages = markdown_output
page_index = 1
while "<-- Page Break -->" in markdown_with_pages:
markdown_with_pages = markdown_with_pages.replace("<-- Page Break -->", f"\n\n--- Page {page_index} ---\n\n", 1)
page_index += 1
markdown_with_pages +=f"\n\n--- Page {page_index} ---\n\n" # for final page
I'll think about proposing an integration of this in the future to an official docling version, ofc after optimizing it....