markitdown text in the images in a pdf is not recognizable

I was trying to extract text from PDF which resulted in extracting only text but what about the text which are present in the form of images, it is not working !!!

Apr 08 '25 09:04 yashkassa

Hey @yashkassa, markitdown currently doesn't support extracting text that's embedded within images inside pdfs, as it doesn't include ocr capabilities.

If your pdf contains scanned pages or images with text, you'll need to use an ocr tool alongside markitdown.

Apr 08 '25 10:04 emreyesilyurt

ssssssssss

Apr 09 '25 04:04 yyuyu3545

Hi, I’d like to work on this issue and add OCR support to handle text embedded within images in PDFs. I plan to use Tesseract or a similar tool to extract such text and integrate it with the existing PDF processing logic.

I’ve already started exploring the codebase and located where the PDF parsing happens. I’ll extract image data from the PDF, run OCR, and merge the output with the current text extraction flow.

Could you please assign this issue to me? Thanks!

Apr 10 '25 08:04 achalcipher

@achalcipher Solve the problem directly, pass the test, and initiate PR

Aug 08 '25 04:08 wll8