unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/Two Column PDF partition result in incorrect text.

Open pfcharles opened this issue 1 year ago • 5 comments

Describe the bug When running partition on a two column pdf, text extraction puts characters is the wrong position To Reproduce two_col.pdf

Provide a code snippet that reproduces the issue. elements = partition("two_col.pdf", strategy="fast")

text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.' text attribute of elements[3] = 'relationship'

Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'

two_col.json

Expected behavior Extracted text matches the actual text

Screenshots image

Environment Info Please run python scripts/collect_env.py and paste the output here. OS version: macOS-14.5-arm64-arm-64bit Python version: 3.9.6 unstructured version: 0.14.9 unstructured-inference version: 0.7.36 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 is not installed PaddleOCR is not installed Libmagic version: file-5.41 magic file from /usr/share/file/magic LibreOffice version: ==> libreoffice: 24.2.4

Additional context Add any other context about the problem here.

pfcharles avatar Jun 28 '24 23:06 pfcharles

Looks like this is a problem with the underling use of the pdfminer library. the data returned by the pdfminer.layout.LTTextBoxHorizontal object get_text() method in pdf.py is wrong.

pfcharles avatar Jun 29 '24 01:06 pfcharles

two_col_not_justified.pdf

This appears to be related the document being text justified and there being larger spaces between words. The issue appears to be related to the implementation of find_neighbors in the pdfminer layout. To some extent this can be controlled by the LAParams initialized in init_pdfminer. Other libs like PyPDF and (java)PDFBox handle with no issue or special configuration.

pfcharles avatar Jun 29 '24 17:06 pfcharles

Running this with pdfminer's pdf2txt.py, does not scramble this paragraph, so it must be something in unstructured use of pdfminer.

pfcharles avatar Jul 03 '24 17:07 pfcharles

@pfcharles Have been able to solve or circumvent this issue?

EDIT: found this here:

"Currently, "hi_res" has difficulty ordering elements for documents with multiple columns. If you have a document with multiple columns that does not have extractable text, we recommend using the "ocr_only" strategy. "

My doc does have extractable text, but the result is better.

hmf avatar Dec 05 '24 06:12 hmf

No, I have not looked into this further.

pfcharles avatar Dec 05 '24 16:12 pfcharles