markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Add page-level text extraction for PDF/PPTX/DOCX documents

Open jeonsworld opened this issue 8 months ago • 12 comments

Summary

Adds optional page extraction to PDF, PPTX, and DOCX converters with extract_pages parameter, returning structured page data while maintaining full backward compatibility.

Motivation

Users need to process PDF/PPTX/DOCX pages separately and know which content comes from which page for page-aware applications. Additionally, local development settings should not be tracked in version control.

Changes

  • New PageInfo class: Stores page number and content
  • Enhanced DocumentConverterResult: Added optional pages attribute
  • Extended converters: Added extract_pages parameter for page-by-page processing in PDF, PPTX, and DOCX converters
  • CLI support: Added --extract-pages and --pages-json flags
  • Comprehensive tests: Test cases covering all scenarios for each format

Usage

Python API

# Traditional (unchanged)
result = md.convert("doc.pdf")

# New page extraction - works for PDF, PPTX, and DOCX
result = md.convert("doc.pdf", extract_pages=True)
result = md.convert("presentation.pptx", extract_pages=True)
result = md.convert("document.docx", extract_pages=True)

for page in result.pages:
    print(f"Page {page.page_number}: {page.content}")

CLI

# Extract pages with JSON output
markitdown doc.pdf --extract-pages --pages-json
markitdown presentation.pptx --extract-pages --pages-json
markitdown document.docx --extract-pages --pages-json

Resolved #210 #122

jeonsworld avatar May 23 '25 07:05 jeonsworld

@microsoft-github-policy-service agree

jeonsworld avatar May 23 '25 07:05 jeonsworld

I like this idea. It meshes well with the pptx slide output as well.

I need to do a little testing before merging -- I'll try to do that this weekend.

afourney avatar May 23 '25 20:05 afourney

Hi team - any ETA on the release of this PR? This would greatly help our project.

mcchoe avatar Jun 12 '25 03:06 mcchoe

@jeonsworld It seems that some statuses are on standby, and we need them for our project, so please move forward.

kanemaru-nec avatar Jun 12 '25 08:06 kanemaru-nec

@afourney Hi, the workflows for this PR are currently pending approval. Could you please review and approve them so the checks can run? Thank you.

jeonsworld avatar Jun 12 '25 12:06 jeonsworld

Hello everyone and @afourney,

Apologies for the tagging but I was wondering if there is an ETA on this? It's something that would be very useful overall and also for a particular project my team is working on.

gaccastro avatar Jul 03 '25 08:07 gaccastro

This feature will be very useful so I am also wondering when this can be approved. Thank you.

hkaraoguz avatar Jul 05 '25 11:07 hkaraoguz

Has this been implemented yet?

Would be helpful to me aswell!

nuldertien avatar Jul 15 '25 14:07 nuldertien

Hi! First of all, thank you so much for developing this feature. Could you please let me know when this version will be released? It would be incredibly helpful for my project!

Abhiraj-Alois avatar Jul 30 '25 07:07 Abhiraj-Alois

Such an important feature can we get new build here 0.7.2 with extract_pages feature added to it

dj953590 avatar Aug 03 '25 00:08 dj953590

Hi, want to ask if this will be merged into the main branch? This is a really important feature

semor-joe avatar Sep 18 '25 15:09 semor-joe