markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Nested tables in DOCX are lost when converting to Markdown

Open Wuhall opened this issue 9 months ago • 1 comments

Description:
When converting a DOCX file containing nested tables to Markdown using markitdown, the inner table content is discarded in the output. This occurs consistently with specific document structures.

Steps to Reproduce:

  1. Environment:
    • Device: MacBook Pro with M3 chip

    • Installation:

    pip install -e 'packages/markitdown[all]'
    
  2. Test File:
    • [Attach a minimal DOCX file with nested tables (e.g., outer table → inner table → text)].

  3. Command:

    markitdown path-to-file.docx > document.md
    
  4. Observed Result:
    • Outer table structure is preserved, but inner table content is missing in document.md.

  5. Expected Result:
    • Both outer and inner tables should be rendered in Markdown (e.g., as nested HTML tables or flattened Markdown).

Image

Wuhall avatar May 14 '25 01:05 Wuhall

I've reviewed the conversion pipeline and identified an issue with nested table handling:

Current Behavior:

  1. DOCX → HTML conversion works correctly (preserves nested tables)
  2. HTML → Markdown conversion using markdownify fails to properly handle nested table structures

Problem: • markdownify flattens nested tables into single-level Markdown tables

• This causes:

• Loss of table hierarchy

• Misaligned columns

• Broken formatting in complex documents

I have submitted a PR in the markdownify,If you encounter similar problems, you can add code here as follows markdownify/init.py

def process_tag(self, node, parent_tags=None):
        # **Handle nested tables**
        if node.name == 'table' and 'table' in parent_tags:
            # If this table is nested within another table, return its HTML representation
            return str(node)

Wuhall avatar May 14 '25 08:05 Wuhall