docling icon indicating copy to clipboard operation
docling copied to clipboard

Docling crashes on the attached docx

Open pmarcis opened this issue 1 year ago • 1 comments

Bug

Docling crashes on the attached docx file.

test_doc-small.docx

The file features examples of:

  1. standard heading styles;
  2. custom heading styles (styles that have exact outline level specified);
  3. tables;
  4. formulas;
  5. enumerations with numbers (single-level);
  6. enumerations with numbers (multi-level);
  7. enumerations with bullet-points;
  8. headers;
  9. footers;
  10. footnotes.

The file was created using Word from Microsoft Office Professional Plus 2019.

Steps to reproduce

Create a python script: convert_with_docling.py

#!/usr/bin/env python3

import sys
from docling.document_converter import DocumentConverter

def main():
    if len(sys.argv) != 3:
        print(f"Usage: {sys.argv[0]} <source> <output_file>")
        sys.exit(1)
    source = sys.argv[1]
    output_file = sys.argv[2]
    converter = DocumentConverter()
    result = converter.convert(source)
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(result.document.export_to_markdown())

if __name__ == "__main__":
    main()

Run:

python convert_with_docling.py test_doc-small.docx output.md

I get the following exception:

Traceback (most recent call last):
  File "./pandoc_test/convert_with_docling.py", line 18, in <module>
    main()
  File "./pandoc_test/convert_with_docling.py", line 13, in main
    result = converter.convert(source)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/pydantic/_internal/_validate_call.py", line 38, in wrapper_function
    return wrapper(*args, **kwargs)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/pydantic/_internal/_validate_call.py", line 111, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/document_converter.py", line 203, in convert
    return next(all_res)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/document_converter.py", line 226, in convert_all
    for conv_res in conv_res_iter:
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/document_converter.py", line 261, in _convert
    for item in map(
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/document_converter.py", line 302, in _process_document
    conv_res = self._execute_pipeline(in_doc, raises_on_error=raises_on_error)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/document_converter.py", line 325, in _execute_pipeline
    conv_res = pipeline.execute(in_doc, raises_on_error=raises_on_error)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 53, in execute
    raise e
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 45, in execute
    conv_res = self._build_document(conv_res)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/simple_pipeline.py", line 41, in _build_document
    conv_res.document = conv_res.input._backend.convert()
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 119, in convert
    doc = self.walk_linear(self.docx_obj.element.body, self.docx_obj, doc)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 195, in walk_linear
    self.handle_text_elements(element, docx_obj, doc)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 324, in handle_text_elements
    self.add_header(doc, p_level, text)
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling/backend/msword_backend.py", line 371, in add_header
    self.parents[curr_level] = doc.add_heading(
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/docling_core/types/doc/document.py", line 1746, in add_heading
    section_header_item = SectionHeaderItem(
  File "~/anaconda3/envs/docling/lib/python3.10/site-packages/pydantic/main.py", line 214, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for SectionHeaderItem
level
  Input should be less than or equal to 100 [type=less_than_equal, input_value=111, input_type=int]
    For further information visit https://errors.pydantic.dev/2.10/v/less_than_equal

Docling version

2.18.0

Python version

Python 3.10.4

pmarcis avatar Feb 05 '25 12:02 pmarcis

Met the same problem.

Docling version 2.23.0

Python version Python 3.12.2

1 validation error for SectionHeaderItem level Input should be less than or equal to 100 [type=less_than_equal, input_value=300, input_type=int] For further information visit https://errors.pydantic.dev/2.10/v/less_than_equal Failed to load file static/upload/SectionHeader.docx with error: 1 validation error for SectionHeaderItem level Input should be less than or equal to 100 [type=less_than_equal, input_value=300, input_type=int] For further information visit https://errors.pydantic.dev/2.10/v/less_than_equal. Skipping...

wzdavid avatar Feb 21 '25 02:02 wzdavid

same for me, encountered the same error when converting one of my docx files. document uses the same template as other docx that can be converted and saved as markdown properly.

conv_result.document.save_as_markdown(md_file, image_mode=ImageRefMode.REFERENCED)

Error during conversion: 1 validation error for SectionHeaderItem level Input should be less than or equal to 100 [type=less_than_equal, input_value=311, input_type=int] For further information visit https://errors.pydantic.dev/2.11/v/less_than_equal

Docling version: 2.27.0 Python version: 3.12.5

BireleyX avatar Apr 24 '25 06:04 BireleyX