text in the images in a pdf is not recognizable
I was trying to extract text from PDF which resulted in extracting only text but what about the text which are present in the form of images, it is not working !!!
Hey @yashkassa, markitdown currently doesn't support extracting text that's embedded within images inside pdfs, as it doesn't include ocr capabilities.
If your pdf contains scanned pages or images with text, you'll need to use an ocr tool alongside markitdown.
ssssssssss
Hi, I’d like to work on this issue and add OCR support to handle text embedded within images in PDFs. I plan to use Tesseract or a similar tool to extract such text and integrate it with the existing PDF processing logic.
I’ve already started exploring the codebase and located where the PDF parsing happens. I’ll extract image data from the PDF, run OCR, and merge the output with the current text extraction flow.
Could you please assign this issue to me? Thanks!
@achalcipher Solve the problem directly, pass the test, and initiate PR