markitdown
markitdown copied to clipboard
Titles and subtitles not recognized on docx documents
I was trying to convert this file using Python API :
and i have this output :
Default paragraph style
title
subtitle
# heading 1
## heading 2
### heading 3
#### heading 4
block quotation
preformated text
body text
normal
Is there a way to preserve at least titles and subtitles information ?
documents : test.docx
This is related to an issue with mammoth not converting headings correctly. https://github.com/mwilliamson/python-mammoth/issues/153
This could be solved by utilising a custom plugin where you use the mammoth convert_to_markdown method (instead of converting to html which this package does). Or wait until there’s a fix for mammoth.
I fixed this by providing a style map to the convert() function.
from markitdown import MarkItDown
style_map = """
p[style-name='Title'] => h1:fresh
p[style-name='Subtitle'] => h2:fresh
"""
md = MarkItDown()
result = md.convert("test.docx", style_map=style_map)
print(result.text_content)
which gives me the desired output
Default paragraph style
# title
## subtitle
# heading 1
## heading 2
### heading 3
#### heading 4
block quotation
preformated text
body text
normal