PyMuPDF The text information obtained by get

Description of the bug

I encountered an issue while processing the file, where the string obtained using the get_text() method was missing some data compared to the original PDF

The reason why the coordinate information is multiplied by 2 is because I applied double scaling when generating the image

How to reproduce the bug

import fitz
import cv2
file_path = 'data/mscbookin.pdf'
png_path = 'data/mscbookin.pdf_0.png'

pdf = fitz.open(file_path)
page = pdf.load_page(0)
image = cv2.imread(png_path)

blocks = page.get_text(option='dict', clip=fitz.INFINITE_IRECT())['blocks']

for item in blocks:
    x1, y1, x2, y2 = [int(i) * 2 for i in list(item['bbox'])]
    cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

cv2.imshow('Image with Rectangle', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

Jun 26 '24 03:06 1339503169

There is a difference in the behavior of the base library. I am going to transfer this report to MuPDF's issue tracker and report the tracking number here.

Jun 26 '24 08:06 JorjMcKie

Test outputs: mutool-12311.txt mutool-12404.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707843

Jun 26 '24 08:06 JorjMcKie

The MuPDF team has investigated this case diligently. It turned out that the PDF contains explicit instructions ("ActualText") which cause ignoring some of the visible text.

MuPDF, like Adobe Acrobat, with its recent versions honors these ActualText instructions (and both thus ignore text pieces excluded by them), while yet other viewers in contrast ignore ActualText instructions and thus may extract text which should be ignored.

We are honoring an explicit choice by the file author who says "ignore these text particles as part of the extraction".

There is no way to switch between these alternatives.

Jul 18 '24 13:07 JorjMcKie

The text information obtained by get_text() is partially missing

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version