markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Titles and subtitles not recognized on docx documents

Open Vrobin0101 opened this issue 1 year ago • 2 comments

I was trying to convert this file using Python API :

Image

and i have this output :

Default paragraph style

title

subtitle

# heading 1

## heading 2

### heading 3

#### heading 4

block quotation

preformated text

body text

normal

Is there a way to preserve at least titles and subtitles information ?

documents : test.docx

Vrobin0101 avatar Feb 12 '25 09:02 Vrobin0101

This is related to an issue with mammoth not converting headings correctly. https://github.com/mwilliamson/python-mammoth/issues/153

This could be solved by utilising a custom plugin where you use the mammoth convert_to_markdown method (instead of converting to html which this package does). Or wait until there’s a fix for mammoth.

adamdavidconn avatar Feb 23 '25 12:02 adamdavidconn

I fixed this by providing a style map to the convert() function.

from markitdown import MarkItDown
style_map = """
p[style-name='Title'] => h1:fresh
p[style-name='Subtitle'] => h2:fresh
"""
md = MarkItDown() 
result = md.convert("test.docx", style_map=style_map)
print(result.text_content)

which gives me the desired output

Default paragraph style

# title

## subtitle

# heading 1

## heading 2

### heading 3

#### heading 4

block quotation

preformated text

body text

normal

RichardAffolter avatar Mar 19 '25 15:03 RichardAffolter