markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

text in the images in a pdf is not recognizable

Open yashkassa opened this issue 10 months ago • 3 comments

I was trying to extract text from PDF which resulted in extracting only text but what about the text which are present in the form of images, it is not working !!!

yashkassa avatar Apr 08 '25 09:04 yashkassa

Hey @yashkassa, markitdown currently doesn't support extracting text that's embedded within images inside pdfs, as it doesn't include ocr capabilities.

If your pdf contains scanned pages or images with text, you'll need to use an ocr tool alongside markitdown.

emreyesilyurt avatar Apr 08 '25 10:04 emreyesilyurt

ssssssssss

yyuyu3545 avatar Apr 09 '25 04:04 yyuyu3545

Hi, I’d like to work on this issue and add OCR support to handle text embedded within images in PDFs. I plan to use Tesseract or a similar tool to extract such text and integrate it with the existing PDF processing logic.

I’ve already started exploring the codebase and located where the PDF parsing happens. I’ll extract image data from the PDF, run OCR, and merge the output with the current text extraction flow.

Could you please assign this issue to me? Thanks!

achalcipher avatar Apr 10 '25 08:04 achalcipher

@achalcipher Solve the problem directly, pass the test, and initiate PR

wll8 avatar Aug 08 '25 04:08 wll8