EasyOCR does not extract text properly
Bug
When using EasyOCR as part of docling pipeline, no text is detected.
Steps to reproduce
File to reproduce: I simply printed the following link to PDF usind Microsoft Print to PDF. This way, the PDF is by design not machine-readable.
import io
import base64
FILE = "file.pdf"
with open(FILE, 'rb') as f:
encoded_string = base64.b64encode(f.read())
base64_file = encoded_string.decode('utf-8')
file_bytes = base64.b64decode(base64_file )
file_bytes = io.BytesIO(file_bytes)
from docling.datamodel.base_models import InputFormat
from docling.datamodel.base_models import DocumentStream
from docling.document_converter import PdfFormatOption, DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(
lang=["en"],
use_gpu=False,
download_enabled=False,
model_storage_directory=".EasyOCR/model/",
)
pipeline_options.artifacts_path = "."
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
backend=DoclingParseV2DocumentBackend,
)
}
)
source = DocumentStream(name="xxxx.pdf", stream=file_bytes)
result = doc_converter.convert(source)
result.document.export_to_text()
'<missing-text>\n\n<missing-text>\n\n<missing-text>\n\n<missing-text>'
When using plain easyocr, the OCR is working as expected:
import easyocr
reader = easyocr.Reader(["en"], gpu=False, download_enabled=False, model_storage_directory=".EasyOCR/model") # this needs to run only once to load the model into memory
result = reader.readtext("sample.png")
result
[([[141, 27], [227, 27], [227, 63], [141, 63]], 'Code', 0.999991774559021),
([[308, 27], [468, 27], [468, 65], [308, 65]],
'Markdown',
0.9905806747834137),
([[573, 29], [685, 29], [685, 65], [573, 65]], 'Run AlI', 0.6046530428334653),
([[765, 29], [875, 29], [875, 65], [765, 65]], 'Restart', 0.9854204139007381),
([[906, 38], [946, 38], [946, 62], [906, 62]], 'Ex', 0.32902462879268796),
([[953, 29], [1207, 29], [1207, 71], [953, 71]],
'Clear AIl Outputs',
[...]
Docling version
2.4.0
Python version
3.11
just a note to say I'm experiencing the same issue
I was able to narrow down the behavior. If I set force_full_page_ocr=True EasyOCR is extracting the text correctly (i.e., performing OCR). However, since this is a PDF that is not machine-readable at all (i.e., only consisting of pictures), I was expecting that docling figures this out automatically and applies OCR to the pages by default (since pipeline_options.ocr_options.bitmap_area_threshold = 0.05 is the default).
I find it quite counterintuitive that I had to force the OCR in this scenario. Any ideas how to fix this behaviour?
I dug into the issue a little bit and I am wondering if the code is consistent.
In pipeline_options.py Link you set the following default:
bitmap_area_threshold: float = (
0.05 # percentage of the area for a bitmap to processed with OCR
)
In base_ocr_model.py Link you set:
BITMAP_COVERAGE_TRESHOLD = 0.75
Your proceed then Link:
# return full-page rectangle if sufficiently covered with bitmaps
if self.options.force_full_page_ocr or coverage > max(
BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
):
Maybe I am not interpreting this correctly, but by using max your are always requiring that coverage > 0.75, irrespective of if the user specifies a smaller value using bitmap_area_threshold in the OCR options (because BITMAP_COVERAGE_TRESHOLD is hard-coded). Shouldn't it be min, such that the user can effectively use the bitmap_area_threshold option to have control over when OCR is performed and when it is not?
I was able to narrow down the behavior. If I set
force_full_page_ocr=TrueEasyOCR is extracting the text correctly (i.e., performing OCR). However, since this is a PDF that is not machine-readable at all (i.e., only consisting of pictures), I was expecting thatdoclingfigures this out automatically and applies OCR to the pages by default (sincepipeline_options.ocr_options.bitmap_area_threshold = 0.05is the default).I find it quite counterintuitive that I had to force the OCR in this scenario. Any ideas how to fix this behaviour?
Thanks for your find, force_full_page_ocr=True solve my problem.
As all pdf I need to deal is english or chinese, I use a simple logic to detect if docling extract correct text.
As the top 140 char in chinese would occurs about 50% in most chinese articles, so I combined chunks text until 1000 chars, the count the frequent of them.
If text is wrong, I just add force_full_page_ocr options and let docling re-process document.
It appears that this issue is addressed in multiple places and can be closed.
- Using
pipeline_options.ocr_options.force_full_page_ocr = True(or--force-ocron the CLI) in case you have a PDF file that comes out badly. - Using the OCR language selection (
--ocr-lang) forEasyOcror other engines.
@cau-git In my view the underlying issue is not resolved. 🤔
In the case at hand, we have a file that is not machine-readable at all (every page should be an image). Therefore, I would expect that setting pipeline_options.ocr_options.bitmap_area_threshold = 0.05 should be enough to trigger the OCR step. Why would I need force_full_page_ocr in addition?
@simonschoe You will need force_full_page_ocr if you want to ensure only text cells from the OCR engine are processed. That is the case for example if your PDF does not contain a bitmap (but garbled text is encoded), or if your PDF contains bitmaps but there is also a pre-OCRed text layer on top, which will otherwise be preferred.
@cau-git Could you briefly confirm that we are understanding the behavior of the OCR settings correctly? :)
- perform ocr on bitmap page regions which on aggregate exceed 'bitmap_area_threshold' (i.e., 5% of page pixel space by default)
- perform ocr on full page if 'force_full_page_ocr' is true or bitmap page regions exceed 0.75 on aggregate (see: https://github.com/docling-project/docling/blob/v2.23.0/docling/models/base_ocr_model.py#L29)
# return full-page rectangle if sufficiently covered with bitmaps if self.options.force_full_page_ocr or coverage > max( BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold ):Maybe I am not interpreting this correctly, but by using max your are always requiring that coverage > 0.75, irrespective of if the user specifies a smaller value using bitmap_area_threshold in the OCR options (because BITMAP_COVERAGE_TRESHOLD is hard-coded). Shouldn't it be min, such that the user can effectively use the bitmap_area_threshold option to have control over when OCR is performed and when it is not?
@simonschoe My understanding is that 0.75 is merely a reasonable minimum coverage to trigger full page OCR. Consider the case where bitmap_area_threshold = 0.9, then full page OCR should not be triggered on less than 0.9 coverage, hence the max calculation. When coverage is not big enough to justify a full page OCR, then it gets checked against bitmap_area_threshold in the following branch to decide on doing regular OCR.