markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Table exctraction from PDF is advertised but completely absent

Open riccardomalavolti opened this issue 4 months ago • 3 comments

Version 0.1.3

docker run --rm -i markdown:latest < ~/example.pdf > output.md

where example.pdf is a native PDF (not a scanned document).

markitdown extracts the text but there's no sign of tables, the output is simply interleaved by newlines.

riccardomalavolti avatar Sep 17 '25 06:09 riccardomalavolti

Looking at the source code, you can see that pdf is still using pdfminer. You can see the effect of converting pdf to md. Don't have too high expectations. Now the ocr model is used to realize the conversion of text, tables and formulas.

bjfk2006 avatar Sep 18 '25 11:09 bjfk2006

If you're still looking to accurately extract the tables from PDF check out this library

emcf avatar Oct 01 '25 14:10 emcf

Had to install and try this to find out the truth. :(

boldandbusted avatar Nov 20 '25 15:11 boldandbusted