markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Use Marker for PDF text extraction

Open clemlesne opened this issue 5 months ago • 3 comments

Marker is a library that extracts the content of PDFs qyuxly, while preserving semantic context. It runs quickly and has both GPU acceleration and LLM support. Output can be Markdown or structured.

Config is simple:

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

converter = PdfConverter(
    artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)

How it works:

  • Extract text, OCR if necessary (heuristics, surya)
  • Detect page layout and find reading order (surya)
  • Clean and format each block (heuristics, texify, surya)
  • Optionally use an LLM to improve quality
  • Combine blocks and postprocess complete text

nb, I’m not a maintainer of the project.

clemlesne avatar Aug 13 '25 07:08 clemlesne

❤️

mshahzad458701-tech avatar Aug 15 '25 12:08 mshahzad458701-tech

Marker is great, but unfortunately, the idea of a heuristic pipeline with multiple fine-tuned specialized models ignores the bitter lesson.

I only see a future for PDF extraction using general purpose vision language models (which is why I am maintainer of this more opinionated package)

emcf avatar Oct 01 '25 14:10 emcf

Important thing about Marker you need to be aware of:

Commercial usage Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page here.

demobvs avatar Oct 18 '25 10:10 demobvs