docling icon indicating copy to clipboard operation
docling copied to clipboard

EasyOCR does not extract text properly

Open simonschoe opened this issue 1 year ago • 4 comments

Bug

When using EasyOCR as part of docling pipeline, no text is detected.

Steps to reproduce

File to reproduce: I simply printed the following link to PDF usind Microsoft Print to PDF. This way, the PDF is by design not machine-readable.

import io
import base64

FILE = "file.pdf"
with open(FILE, 'rb') as f:
        encoded_string = base64.b64encode(f.read())
base64_file = encoded_string.decode('utf-8')
file_bytes = base64.b64decode(base64_file )
file_bytes = io.BytesIO(file_bytes)

from docling.datamodel.base_models import InputFormat
from docling.datamodel.base_models import DocumentStream
from docling.document_converter import PdfFormatOption, DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode, EasyOcrOptions
from docling.backend.docling_parse_v2_backend import DoclingParseV2DocumentBackend

pipeline_options = PdfPipelineOptions()

pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(
    lang=["en"],
    use_gpu=False,
    download_enabled=False,
    model_storage_directory=".EasyOCR/model/",
)
pipeline_options.artifacts_path = "."

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            backend=DoclingParseV2DocumentBackend,
        )
    }
)

source = DocumentStream(name="xxxx.pdf", stream=file_bytes)
result = doc_converter.convert(source)
result.document.export_to_text()
'<missing-text>\n\n<missing-text>\n\n<missing-text>\n\n<missing-text>'

When using plain easyocr, the OCR is working as expected:

import easyocr

reader = easyocr.Reader(["en"], gpu=False, download_enabled=False, model_storage_directory=".EasyOCR/model") # this needs to run only once to load the model into memory
result = reader.readtext("sample.png")
result
[([[141, 27], [227, 27], [227, 63], [141, 63]], 'Code', 0.999991774559021),
 ([[308, 27], [468, 27], [468, 65], [308, 65]],
  'Markdown',
  0.9905806747834137),
 ([[573, 29], [685, 29], [685, 65], [573, 65]], 'Run AlI', 0.6046530428334653),
 ([[765, 29], [875, 29], [875, 65], [765, 65]], 'Restart', 0.9854204139007381),
 ([[906, 38], [946, 38], [946, 62], [906, 62]], 'Ex', 0.32902462879268796),
 ([[953, 29], [1207, 29], [1207, 71], [953, 71]],
  'Clear AIl Outputs',
[...]

Docling version

2.4.0

Python version

3.11

simonschoe avatar Nov 11 '24 09:11 simonschoe

just a note to say I'm experiencing the same issue

JTCorrin avatar Nov 11 '24 12:11 JTCorrin

I was able to narrow down the behavior. If I set force_full_page_ocr=True EasyOCR is extracting the text correctly (i.e., performing OCR). However, since this is a PDF that is not machine-readable at all (i.e., only consisting of pictures), I was expecting that docling figures this out automatically and applies OCR to the pages by default (since pipeline_options.ocr_options.bitmap_area_threshold = 0.05 is the default).

I find it quite counterintuitive that I had to force the OCR in this scenario. Any ideas how to fix this behaviour?

simonschoe avatar Nov 19 '24 15:11 simonschoe

I dug into the issue a little bit and I am wondering if the code is consistent.

In pipeline_options.py Link you set the following default:

bitmap_area_threshold: float = (
        0.05  # percentage of the area for a bitmap to processed with OCR
    )

In base_ocr_model.py Link you set:

BITMAP_COVERAGE_TRESHOLD = 0.75

Your proceed then Link:

# return full-page rectangle if sufficiently covered with bitmaps
        if self.options.force_full_page_ocr or coverage > max(
            BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
        ):

Maybe I am not interpreting this correctly, but by using max your are always requiring that coverage > 0.75, irrespective of if the user specifies a smaller value using bitmap_area_threshold in the OCR options (because BITMAP_COVERAGE_TRESHOLD is hard-coded). Shouldn't it be min, such that the user can effectively use the bitmap_area_threshold option to have control over when OCR is performed and when it is not?

simonschoe avatar Nov 21 '24 20:11 simonschoe

I was able to narrow down the behavior. If I set force_full_page_ocr=True EasyOCR is extracting the text correctly (i.e., performing OCR). However, since this is a PDF that is not machine-readable at all (i.e., only consisting of pictures), I was expecting that docling figures this out automatically and applies OCR to the pages by default (since pipeline_options.ocr_options.bitmap_area_threshold = 0.05 is the default).

I find it quite counterintuitive that I had to force the OCR in this scenario. Any ideas how to fix this behaviour?

Thanks for your find, force_full_page_ocr=True solve my problem.

As all pdf I need to deal is english or chinese, I use a simple logic to detect if docling extract correct text.

As the top 140 char in chinese would occurs about 50% in most chinese articles, so I combined chunks text until 1000 chars, the count the frequent of them.

If text is wrong, I just add force_full_page_ocr options and let docling re-process document.

bash99 avatar Dec 06 '24 03:12 bash99

It appears that this issue is addressed in multiple places and can be closed.

  1. Using pipeline_options.ocr_options.force_full_page_ocr = True (or --force-ocr on the CLI) in case you have a PDF file that comes out badly.
  2. Using the OCR language selection (--ocr-lang) for EasyOcr or other engines.

cau-git avatar Dec 09 '24 15:12 cau-git

@cau-git In my view the underlying issue is not resolved. 🤔

In the case at hand, we have a file that is not machine-readable at all (every page should be an image). Therefore, I would expect that setting pipeline_options.ocr_options.bitmap_area_threshold = 0.05 should be enough to trigger the OCR step. Why would I need force_full_page_ocr in addition?

simonschoe avatar Dec 13 '24 12:12 simonschoe

@simonschoe You will need force_full_page_ocr if you want to ensure only text cells from the OCR engine are processed. That is the case for example if your PDF does not contain a bitmap (but garbled text is encoded), or if your PDF contains bitmaps but there is also a pre-OCRed text layer on top, which will otherwise be preferred.

cau-git avatar Dec 13 '24 13:12 cau-git

@cau-git Could you briefly confirm that we are understanding the behavior of the OCR settings correctly? :)

  1. perform ocr on bitmap page regions which on aggregate exceed 'bitmap_area_threshold' (i.e., 5% of page pixel space by default)
  2. perform ocr on full page if 'force_full_page_ocr' is true or bitmap page regions exceed 0.75 on aggregate (see: https://github.com/docling-project/docling/blob/v2.23.0/docling/models/base_ocr_model.py#L29)

simonschoe avatar Mar 27 '25 08:03 simonschoe

# return full-page rectangle if sufficiently covered with bitmaps
        if self.options.force_full_page_ocr or coverage > max(
            BITMAP_COVERAGE_TRESHOLD, self.options.bitmap_area_threshold
        ):

Maybe I am not interpreting this correctly, but by using max your are always requiring that coverage > 0.75, irrespective of if the user specifies a smaller value using bitmap_area_threshold in the OCR options (because BITMAP_COVERAGE_TRESHOLD is hard-coded). Shouldn't it be min, such that the user can effectively use the bitmap_area_threshold option to have control over when OCR is performed and when it is not?

@simonschoe My understanding is that 0.75 is merely a reasonable minimum coverage to trigger full page OCR. Consider the case where bitmap_area_threshold = 0.9, then full page OCR should not be triggered on less than 0.9 coverage, hence the max calculation. When coverage is not big enough to justify a full page OCR, then it gets checked against bitmap_area_threshold in the following branch to decide on doing regular OCR.

drohmf avatar May 24 '25 13:05 drohmf