Support for right to left languages
I tried a pdf file in Arabic with a few English words but the Arabic text is reversed (words in a sentence and letters per word).
Used pipeline_options.ocr_options.lang = ["ar"] in the latest version.
Does DocLing support right to left languages and mixed docs?
@abedkhooli Yes, we will need to look into this carfully. Could you provide us with a simple, programmatic example? I believe, I will have to update the docling-parse backend for that.
We can use this as an example for scanned PDF: https://github.com/ocrmypdf/OCRmyPDF/issues/1157#issuecomment-1762851062
The test pdf file in ocrmypdf/OCRmyPDF/issues/1157 works in the current version of ocrmypdf (using the tesseract-ocr-ara package). Words in a sentence and letters in a word are in the right order. Here's the command used:
ocrmypdf --sidecar output.txt --force-ocr -l ara+eng default.pdf result.pdf
default.pdf is the test file.
I put together a colab notebook on how to use Tesseract in DocLing. It works in general but not the greatest quality. Maybe there are more tweaks for better results.
@abedkhooli Thanks for the info! Let us look into it carefully!
test_doc.pdf Using the same code, this pdf (live text) fails in DocLing (reversed words and letters), but OCRmyPDF gets it right.
solved here: https://github.com/DS4SD/docling-parse/pull/73
i have still problem with Persian language pdf file, please advice how this problem solved in newer version? regards