ChatGPT OCR results are generated in different languages

Open tanreinama opened this issue 1 year ago • 1 comments

The text written in Japanese on the image is translated into English and output.

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("example.jpg")  ### Japanese Language Image
print(result.text_content)  ### English output

In some cases, the entire document will be in English, while in other cases only part of the document (only the title) will be in English.

Depending on the requirements of your RAG, this may not be desirable, so it is better to be able to specify the output language or to fix it to the original language found in the image.

Feb 16 '25 05:02 tanreinama

Hi, I was looking into this issue and couldn't find anything related to OCR in the code. Based on my understanding, the library processes the image by passing it to the provided LLM along with a prompt. If no custom prompt is given, it defaults to:

"Write a detailed caption for this image."

Since the prompt is in English, the LLM likely assumes the response should also be in English. This might explain why captions are always generated in English, even if the image contains text in another language.

A way to address this could be to allow users to specify a preferred language or check if the LLM itself supports automatic language detection and leveraging that if possible.

I’d love to work on this issue and implement a fix! Let me know if this approach makes sense or if you have any suggestions.

Feb 18 '25 05:02 Si-ddhartha