docling icon indicating copy to clipboard operation
docling copied to clipboard

Convert Markdown document incorrect

Open kime541200 opened this issue 1 year ago • 3 comments

Bug

Convert Markdown document error. ...

Steps to reproduce

Original content of the Markdown document is something like:

# ABCDEFG
- abc:
	- abc123:
		- abc1234:
			- abc12345:
				- a.
				- b.
		- abcd1234:
			- abcd12345:
				- a.
				- b.
- def:
	- def1234:
		- def12345。
- ghijkl

Here's the convert process:

$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG

- abc:
  - abc123:
    - abc1234:
      - abc12345:
        - a.
        - b.
      - abcd1234:
        - abcd12345:
          - a.
          - b.
- def:
  - def1234:
    - def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
 <BlankLine children=[]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
 <List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
 <ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.

And here's the final result I got:

$ cat test2.md
# ABCDEFG

- abc:
- def:
- ghijkl

I also try to use python library to convert this document, but I still got same output.

In final result, a lot content is not been output, did I do anything wrong?

PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.

Docling version

$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0

Python version

$ python --version
Python 3.11.10

kime541200 avatar Dec 18 '24 13:12 kime541200

Similar issue is happening with inline code using `

Converting the following markdown file and exporting it back to markdown using DocumentConverter().convert("file.md").document.export_to_markdown() results in docling cutting off the text after the `

Input:

# Contributing

1. Pull the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

Exported Markdown:

 # Contributing

- Pull the repository
- Create your feature branch (
- Commit your changes (
- Push to the branch (
- Open a Pull Request

Heremeus avatar Dec 19 '24 14:12 Heremeus

Did a little debugging and it seems this stems from the md_backend. The handling for marko.block.ListItem only considers the first children, ignoring any other children of the ListItem.

https://github.com/DS4SD/docling/blob/fc645ea531ddc67959640b428007851d641c923e/docling/backend/md_backend.py#L212

In my example above, element.children[0] is a Paragraph containing multiple RawText and CodeSpan children. element.children[0].children[0] only uses the first RawText child and ignores the rest of the Paragraph.

Heremeus avatar Dec 19 '24 15:12 Heremeus

@kime541200 @Heremeus Thanks for these findings, I will look into this issue!

maxmnemonic avatar Dec 19 '24 16:12 maxmnemonic

@kime541200 a fix for the reported issue has been released with v2.19.0.

@Heremeus I moved the issue you reported in the comment section over to a new issue (#913) to be addressed separately.

vagenas avatar Feb 07 '25 14:02 vagenas