docling
docling copied to clipboard
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656
Bug
Trying to convert a PDF I get the following error, the same options works on other PDFs.
Seems related to pandas.read_csv() on the TSV output of Tesseract.
Encountered an error during conversion of document b137be2685712845d8afee55fe6327d2901290f9a852a25b3f7b19010df64e10:
Traceback (most recent call last):
File ".../docling/pipeline/base_pipeline.py", line 149, in _build_document
for p in pipeline_pages: # Must exhaust!
^^^^^^^^^^^^^^
File ".../docling/pipeline/base_pipeline.py", line 116, in _apply_on_pages
yield from page_batch
File ".../docling/models/page_assemble_model.py", line 59, in __call__
for page in page_batch:
^^^^^^^^^^
File ".../docling/models/table_structure_model.py", line 93, in __call__
for page in page_batch:
^^^^^^^^^^
File ".../docling/models/layout_model.py", line 281, in __call__
for page in page_batch:
^^^^^^^^^^
File ".../docling/models/tesseract_ocr_cli_model.py", line 140, in __call__
df = self._run_tesseract(fname)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../docling/models/tesseract_ocr_cli_model.py", line 98, in _run_tesseract
df = pd.read_csv(io.StringIO(decoded_data), sep="\t")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pandas/io/parsers/readers.py", line 626, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File ".../pandas/io/parsers/readers.py", line 1923, in read
) = self._engine.read( # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".../pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 656
Steps to reproduce
ocr_options = TesseractCliOcrOptions(force_full_page_ocr=True)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = ocr_options
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_options=pipeline_options,
)
}
)
conv_res = converter.convert(Path(my_pdf_path))
Docling version
Docling version: 2.5.2
Docling Core version: 2.4.0
Docling IBM Models version: 2.0.3
Docling Parse version: 2.0.4
Python version
Python 3.12.7