pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656

Open guglie opened this issue 1 year ago • 0 comments

Bug

Trying to convert a PDF I get the following error, the same options works on other PDFs. Seems related to pandas.read_csv() on the TSV output of Tesseract.

Encountered an error during conversion of document b137be2685712845d8afee55fe6327d2901290f9a852a25b3f7b19010df64e10:
Traceback (most recent call last):

  File ".../docling/pipeline/base_pipeline.py", line 149, in _build_document
    for p in pipeline_pages:  # Must exhaust!
             ^^^^^^^^^^^^^^

  File ".../docling/pipeline/base_pipeline.py", line 116, in _apply_on_pages
    yield from page_batch

  File ".../docling/models/page_assemble_model.py", line 59, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/table_structure_model.py", line 93, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/layout_model.py", line 281, in __call__
    for page in page_batch:
                ^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 140, in __call__
    df = self._run_tesseract(fname)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../docling/models/tesseract_ocr_cli_model.py", line 98, in _run_tesseract
    df = pd.read_csv(io.StringIO(decoded_data), sep="\t")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File ".../pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory

  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows

  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows

  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status

  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error

pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656

Steps to reproduce

ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

conv_res = converter.convert(Path(my_pdf_path))

Docling version

Docling version: 2.5.2
Docling Core version: 2.4.0
Docling IBM Models version: 2.0.3
Docling Parse version: 2.0.4

Python version

Python 3.12.7

Nov 29 '24 16:11 guglie