docling icon indicating copy to clipboard operation
docling copied to clipboard

Support for right to left languages

Open abedkhooli opened this issue 1 year ago • 5 comments

I tried a pdf file in Arabic with a few English words but the Arabic text is reversed (words in a sentence and letters per word). Used pipeline_options.ocr_options.lang = ["ar"] in the latest version. Does DocLing support right to left languages and mixed docs?

abedkhooli avatar Nov 06 '24 06:11 abedkhooli

@abedkhooli Yes, we will need to look into this carfully. Could you provide us with a simple, programmatic example? I believe, I will have to update the docling-parse backend for that.

PeterStaar-IBM avatar Nov 06 '24 09:11 PeterStaar-IBM

We can use this as an example for scanned PDF: https://github.com/ocrmypdf/OCRmyPDF/issues/1157#issuecomment-1762851062

cau-git avatar Nov 06 '24 09:11 cau-git

The test pdf file in ocrmypdf/OCRmyPDF/issues/1157 works in the current version of ocrmypdf (using the tesseract-ocr-ara package). Words in a sentence and letters in a word are in the right order. Here's the command used: ocrmypdf --sidecar output.txt --force-ocr -l ara+eng default.pdf result.pdf default.pdf is the test file. I put together a colab notebook on how to use Tesseract in DocLing. It works in general but not the greatest quality. Maybe there are more tweaks for better results.

abedkhooli avatar Nov 06 '24 12:11 abedkhooli

@abedkhooli Thanks for the info! Let us look into it carefully!

PeterStaar-IBM avatar Nov 06 '24 13:11 PeterStaar-IBM

test_doc.pdf Using the same code, this pdf (live text) fails in DocLing (reversed words and letters), but OCRmyPDF gets it right.

abedkhooli avatar Nov 06 '24 14:11 abedkhooli

solved here: https://github.com/DS4SD/docling-parse/pull/73

PeterStaar-IBM avatar Dec 10 '24 15:12 PeterStaar-IBM

i have still problem with Persian language pdf file, please advice how this problem solved in newer version? regards

hamedf62 avatar Dec 14 '24 18:12 hamedf62