The position box obtained through the get_text() method is inaccurate
Description of the bug
I encountered a case while processing the file, which is a readable PDF. However, there is a significant deviation between the location information obtained by the pymupdf get_text ('words') method and the actual location
How to reproduce the bug
Multiplying the coordinates by two is because I scaled the image twice when producing it
import fitz # PyMuPDF import cv2
doc = fitz.open("data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf")
page = doc.load_page(0)
words = page.get_text("words")
image = cv2.imread('data/8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf_0.png')
for word in words: x0, y0, x1, y1, text, block_no, line_no, word_no = word x0, y0, x1, y1 = [int(i) * 2 for i in [x0, y0, x1, y1]] cv2.rectangle(image, (x0, y0), (x1, y1), (255, 0, 0), 2)
cv2.imshow('demo', image) cv2.waitKey(0) 8989fa66-9bff-4f0c-9f05-37c8a393207e.pdf
Why does this situation occur and how can I obtain the correct location information
PyMuPDF version
1.24.5
Operating system
Windows
Python version
3.8
This seems to be a problem of the fonts embedded in this file. Currently investigating. The first finding is that MuPDF itself computes the coordinates in the same way.
Solution:
Use pymupdf.TOOLS.set_small_glyph_heights(True) right after the import / before any search or extraction.
This will force PyMuPDF to recompute the character bboxes. When marking the words based on this, you will get correct results:
pymupdf.TOOLS.set_small_glyph_heights(True)
words = page.get_text("words")
Result:
MuPDF issue reference: https://bugs.ghostscript.com/show_bug.cgi?id=707833
Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction
Thank you, this does work, but I still have questions. Is this a universal solution? Will he affect other correct files, and can I use this setting as the default configuration for extraction
Universal? No - but it is always available. Whether or not the fonts in your PDF are broken is currently under investigation by the MuPDF team - see above link.
Our MuPDF team diligently has investigated the case. The result as expected was that the fonts in your PDF are badly constructed. This causes all conventional boundary box computations to produce crazy results.
Therefore the circumvention I recommended is the only solution you have. The global parameter enabling this actually ignores some font information and computes boundary boxes completely by itself. If you take a close look, you will see that the precision resulting from this circumvention is limited too: characters going below the baseline (like "g") are not completely included in the boundary box.
I recommend that you send your complaints to the PDF creator.