pfcharles

Results 4 comments of pfcharles

Looks like this is a problem with the underling use of the pdfminer library. the data returned by the pdfminer.layout.LTTextBoxHorizontal object get_text() method in pdf.py is wrong.

[two_col_not_justified.pdf](https://github.com/user-attachments/files/16041719/two_col_not_justified.pdf) This appears to be related the document being text justified and there being larger spaces between words. The issue appears to be related to the implementation of find_neighbors in...

Running this with pdfminer's pdf2txt.py, does not scramble this paragraph, so it must be something in unstructured use of pdfminer.

No, I have not looked into this further.