pdf2docx Is there any way to improve the layout restoration？

Jun 07 '22 07:06 liuxunfei

Hi liuxunfei, it seems no pdf and docx are uploaded.

Jun 07 '22 07:06 dothinking

Hi dothinking, in windows, use the pdf2docx convert command to convert the above PDF into docx. The pictures, tables, and paragraphs in docx are disorderly, and some paragraphs in the source PDF are turned into tables in docx. Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated? Is there room for optimization

Jun 08 '22 02:06 liuxunfei

Many thanks for providing a good case.

Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?

It's the problem of layout analysis. Currently, a very simple layout analysis algorithm is applied, focusing on converting the floating layout in PDF to flowing layout in docx, aiming to create the docx in a similar look. Accordingly, you can see tables are commonly used for layout control.

Is there room for optimization

Machine learning is now a powerful technique for layout analysis, but I'm not yet willing to use it because this will increase the installation / setup difficulty, e.g., tensorflow, or pytorch, especially for the elementary users. I'm now trying traditional computer vision method with python-opencv, but might need more time for a release.

Jun 13 '22 11:06 dothinking