docling Support for right to left languages

I tried a pdf file in Arabic with a few English words but the Arabic text is reversed (words in a sentence and letters per word). Used pipeline_options.ocr_options.lang = ["ar"] in the latest version. Does DocLing support right to left languages and mixed docs?

Nov 06 '24 06:11 abedkhooli

@abedkhooli Yes, we will need to look into this carfully. Could you provide us with a simple, programmatic example? I believe, I will have to update the docling-parse backend for that.

Nov 06 '24 09:11 PeterStaar-IBM

We can use this as an example for scanned PDF: https://github.com/ocrmypdf/OCRmyPDF/issues/1157#issuecomment-1762851062

Nov 06 '24 09:11 cau-git

The test pdf file in ocrmypdf/OCRmyPDF/issues/1157 works in the current version of ocrmypdf (using the tesseract-ocr-ara package). Words in a sentence and letters in a word are in the right order. Here's the command used: ocrmypdf --sidecar output.txt --force-ocr -l ara+eng default.pdf result.pdf default.pdf is the test file. I put together a colab notebook on how to use Tesseract in DocLing. It works in general but not the greatest quality. Maybe there are more tweaks for better results.

Nov 06 '24 12:11 abedkhooli

@abedkhooli Thanks for the info! Let us look into it carefully!

Nov 06 '24 13:11 PeterStaar-IBM

test_doc.pdf Using the same code, this pdf (live text) fails in DocLing (reversed words and letters), but OCRmyPDF gets it right.

Nov 06 '24 14:11 abedkhooli

solved here: https://github.com/DS4SD/docling-parse/pull/73

Dec 10 '24 15:12 PeterStaar-IBM

i have still problem with Persian language pdf file, please advice how this problem solved in newer version? regards

Dec 14 '24 18:12 hamedf62