unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Request for Slimmed-Down PDF Extraction Interface (Exclude Unused OCR Addons)

Open dvirginz opened this issue 10 months ago • 2 comments

Hi Unstructured team,

First of all, thanks for the great work on this library!

We’re currently using unstructured for PDF extraction, but not relying on any OCR-related functionality. However, we’ve noticed that the PDF dependencies, especially those related to OCR (like Tesseract, poppler, etc.), significantly inflate our Docker image size—by several GBs.

For our use case, this additional bloat isn’t necessary and is becoming a challenge for deployment.

Would it be possible to provide a slimmer version of the PDF interface or a way to install it without the OCR-heavy dependencies? For example: • A minimal set of requirements for “text-only” PDF extraction. • Optional installation flags/extras to include or exclude OCR support. • A stripped-down Dockerfile as a reference for lean builds.

We’d be happy to help test or contribute to this effort if needed.

Thanks!

dvirginz avatar May 17 '25 05:05 dvirginz

@dvirginz Related issue https://github.com/Unstructured-IO/unstructured/issues/2128

skulltech avatar May 24 '25 06:05 skulltech

I bumped into the same problem. It would be super cool to have the trimmed down version without the heavy deps.

lordsoffallen avatar Jul 17 '25 09:07 lordsoffallen