Use Marker for PDF text extraction
Marker is a library that extracts the content of PDFs qyuxly, while preserving semantic context. It runs quickly and has both GPU acceleration and LLM support. Output can be Markdown or structured.
Config is simple:
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
converter = PdfConverter(
artifact_dict=create_model_dict(),
)
rendered = converter("FILEPATH")
text, _, images = text_from_rendered(rendered)
How it works:
- Extract text, OCR if necessary (heuristics, surya)
- Detect page layout and find reading order (surya)
- Clean and format each block (heuristics, texify, surya)
- Optionally use an LLM to improve quality
- Combine blocks and postprocess complete text
nb, I’m not a maintainer of the project.
❤️
Marker is great, but unfortunately, the idea of a heuristic pipeline with multiple fine-tuned specialized models ignores the bitter lesson.
I only see a future for PDF extraction using general purpose vision language models (which is why I am maintainer of this more opinionated package)
Important thing about Marker you need to be aware of:
Commercial usage Our model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $2M funding/revenue) and our code is GPL. For broader commercial licensing or to remove GPL requirements, visit our pricing page here.