PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

The position box obtained through the get_text() method is inaccurate

Open 1339503169 opened this issue 1 year ago • 5 comments

Description of the bug

I encountered a case while processing the file, which is a readable PDF. However, there is a significant deviation between the location information obtained by the pymupdf get_text ('words') method and the actual location

How to reproduce the bug

8989fa66-9bff-4f0c-9f05-37c8a393207e pdf_0 image

Multiplying the coordinates by two is because I scaled the image twice when producing it

import fitz # PyMuPDF import cv2

doc = fitz.open("data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf")

page = doc.load_page(0)

words = page.get_text("words")

image = cv2.imread('data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf_0.png')

for word in words: x0, y0, x1, y1, text, block_no, line_no, word_no = word x0, y0, x1, y1 = [int(i) * 2 for i in [x0, y0, x1, y1]] cv2.rectangle(image, (x0, y0), (x1, y1), (255, 0, 0), 2)

cv2.imshow('demo', image) cv2.waitKey(0) 8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf

Why does this situation occur and how can I obtain the correct location information

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

1339503169 avatar Jun 20 '24 09:06 1339503169

This seems to be a problem of the fonts embedded in this file. Currently investigating. The first finding is that MuPDF itself computes the coordinates in the same way.

JorjMcKie avatar Jun 20 '24 09:06 JorjMcKie

Solution: Use pymupdf.TOOLS.set_small_glyph_heights(True) right after the import / before any search or extraction. This will force PyMuPDF to recompute the character bboxes. When marking the words based on this, you will get correct results:

pymupdf.TOOLS.set_small_glyph_heights(True)
words = page.get_text("words")

Result: image

JorjMcKie avatar Jun 20 '24 09:06 JorjMcKie

MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707833

JorjMcKie avatar Jun 20 '24 09:06 JorjMcKie

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

1339503169 avatar Jun 21 '24 01:06 1339503169

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

Universal? No - but it is always available. Whether or not the fonts in your PDF are broken is currently under investigation by the MuPDF team - see above link.

JorjMcKie avatar Jun 21 '24 07:06 JorjMcKie

Our MuPDF team diligently has investigated the case. The result as expected was that the fonts in your PDF are badly constructed. This causes all conventional boundary box computations to produce crazy results.

Therefore the circumvention I recommended is the only solution you have. The global parameter enabling this actually ignores some font information and computes boundary boxes completely by itself. If you take a close look, you will see that the precision resulting from this circumvention is limited too: characters going below the baseline (like "g") are not completely included in the boundary box.

I recommend that you send your complaints to the PDF creator.

JorjMcKie avatar Jul 18 '24 14:07 JorjMcKie