Convert Markdown document incorrect
Bug
Convert Markdown document error. ...
Steps to reproduce
Original content of the Markdown document is something like:
# ABCDEFG
- abc:
- abc123:
- abc1234:
- abc12345:
- a.
- b.
- abcd1234:
- abcd12345:
- a.
- b.
- def:
- def1234:
- def12345。
- ghijkl
Here's the convert process:
$ docling --from md --to md -vv /data/doc/test2.md
DEBUG:docling.backend.md_backend:MD INIT!!!
DEBUG:docling.backend.md_backend:# ABCDEFG
- abc:
- abc123:
- abc1234:
- abc12345:
- a.
- b.
- abcd1234:
- abcd12345:
- a.
- b.
- def:
- def1234:
- def12345.
- ghijkl
INFO:docling.document_converter:Going to convert document batch...
INFO:docling.pipeline.base_pipeline:Processing document test2.md
DEBUG:docling.backend.md_backend:converting Markdown...
DEBUG:docling.backend.md_backend:Some other element: <Document children=[<Heading children=[<RawText children='ABCDEFG'>]>,
<BlankLine children=[]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc123:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abc12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='abcd1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='abcd12345:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='a.'>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='b.'>]>]>]>]>]>]>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='def:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def1234:'>]>,
<List children=[<ListItem children=[<Paragraph children=[<RawText children='def12345.'>]>]>]>]>]>]>,
<ListItem children=[<Paragraph children=[<RawText children='ghijkl'>]>]>]>]>
DEBUG:docling.backend.md_backend: - Heading level 1, content: ABCDEFG
DEBUG:docling.backend.md_backend:Some other element: <BlankLine children=[]>
DEBUG:docling.backend.md_backend: - List unordered
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
DEBUG:docling.backend.md_backend: - List item
INFO:docling.document_converter:Finished converting document test2.md in 2.19 sec.
INFO:docling.cli.main:writing Markdown output to test2.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 2.19 seconds.
And here's the final result I got:
$ cat test2.md
# ABCDEFG
- abc:
- def:
- ghijkl
I also try to use python library to convert this document, but I still got same output.
In final result, a lot content is not been output, did I do anything wrong?
PS: I know that inputting and outputting Markdown might be unnecessary, but in my application scenario, I'm not sure in what format users will provide their content. I need to be able to convert various content formats into Markdown.
Docling version
$ docling --version
Docling version: 2.14.0
Docling Core version: 2.12.1
Docling IBM Models version: 3.1.0
Docling Parse version: 3.0.0
Python version
$ python --version
Python 3.11.10
Similar issue is happening with inline code using `
Converting the following markdown file and exporting it back to markdown using DocumentConverter().convert("file.md").document.export_to_markdown() results in docling cutting off the text after the `
Input:
# Contributing
1. Pull the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
Exported Markdown:
# Contributing
- Pull the repository
- Create your feature branch (
- Commit your changes (
- Push to the branch (
- Open a Pull Request
Did a little debugging and it seems this stems from the md_backend. The handling for marko.block.ListItem only considers the first children, ignoring any other children of the ListItem.
https://github.com/DS4SD/docling/blob/fc645ea531ddc67959640b428007851d641c923e/docling/backend/md_backend.py#L212
In my example above, element.children[0] is a Paragraph containing multiple RawText and CodeSpan children. element.children[0].children[0] only uses the first RawText child and ignores the rest of the Paragraph.
@kime541200 @Heremeus Thanks for these findings, I will look into this issue!
@kime541200 a fix for the reported issue has been released with v2.19.0.
@Heremeus I moved the issue you reported in the comment section over to a new issue (#913) to be addressed separately.