PyMuPDF The position box obtained through the get

Description of the bug

I encountered a case while processing the file, which is a readable PDF. However, there is a significant deviation between the location information obtained by the pymupdf get_text ('words') method and the actual location

How to reproduce the bug

8989fa66-9bff-4f0c-9f05-37c8a393207e pdf_0

Multiplying the coordinates by two is because I scaled the image twice when producing it

import fitz # PyMuPDF import cv2

doc = fitz.open("data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf")

page = doc.load_page(0)

words = page.get_text("words")

image = cv2.imread('data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf_0.png')

for word in words: x0, y0, x1, y1, text, block_no, line_no, word_no = word x0, y0, x1, y1 = [int(i) * 2 for i in [x0, y0, x1, y1]] cv2.rectangle(image, (x0, y0), (x1, y1), (255, 0, 0), 2)

cv2.imshow('demo', image) cv2.waitKey(0) 8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf

Why does this situation occur and how can I obtain the correct location information

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

Jun 20 '24 09:06 1339503169

This seems to be a problem of the fonts embedded in this file. Currently investigating. The first finding is that MuPDF itself computes the coordinates in the same way.

Jun 20 '24 09:06 JorjMcKie

Solution: Use pymupdf.TOOLS.set_small_glyph_heights(True) right after the import / before any search or extraction. This will force PyMuPDF to recompute the character bboxes. When marking the words based on this, you will get correct results:

pymupdf.TOOLS.set_small_glyph_heights(True)
words = page.get_text("words")

Result:

Jun 20 '24 09:06 JorjMcKie

MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707833

Jun 20 '24 09:06 JorjMcKie

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

Jun 21 '24 01:06 1339503169

Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction

Universal? No - but it is always available. Whether or not the fonts in your PDF are broken is currently under investigation by the MuPDF team - see above link.

Jun 21 '24 07:06 JorjMcKie

Our MuPDF team diligently has investigated the case. The result as expected was that the fonts in your PDF are badly constructed. This causes all conventional boundary box computations to produce crazy results.

Therefore the circumvention I recommended is the only solution you have. The global parameter enabling this actually ignores some font information and computes boundary boxes completely by itself. If you take a close look, you will see that the precision resulting from this circumvention is limited too: characters going below the baseline (like "g") are not completely included in the boundary box.

I recommend that you send your complaints to the PDF creator.

Jul 18 '24 14:07 JorjMcKie

The position box obtained through the get_text() method is inaccurate

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version