pdf2docx icon indicating copy to clipboard operation
pdf2docx copied to clipboard

Is there any way to improve the layout restoration?

Open liuxunfei opened this issue 3 years ago • 3 comments

1804.10371.pdf 1804.10371.docx

liuxunfei avatar Jun 07 '22 07:06 liuxunfei

Hi liuxunfei, it seems no pdf and docx are uploaded.

dothinking avatar Jun 07 '22 07:06 dothinking

src.pdf dst.docx

Hi dothinking, in windows, use the pdf2docx convert command to convert the above PDF into docx. The pictures, tables, and paragraphs in docx are disorderly, and some paragraphs in the source PDF are turned into tables in docx. Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated? Is there room for optimization

liuxunfei avatar Jun 08 '22 02:06 liuxunfei

Many thanks for providing a good case.

Is this the problem of PDF data parsing or the problem of data backfilling during layout restoration when word is finally generated?

It's the problem of layout analysis. Currently, a very simple layout analysis algorithm is applied, focusing on converting the floating layout in PDF to flowing layout in docx, aiming to create the docx in a similar look. Accordingly, you can see tables are commonly used for layout control.

Is there room for optimization

Machine learning is now a powerful technique for layout analysis, but I'm not yet willing to use it because this will increase the installation / setup difficulty, e.g., tensorflow, or pytorch, especially for the elementary users. I'm now trying traditional computer vision method with python-opencv, but might need more time for a release.

dothinking avatar Jun 13 '22 11:06 dothinking