The text information obtained by get_text() is partially missing
Description of the bug
I encountered an issue while processing the file, where the string obtained using the get_text() method was missing some data compared to the original PDF
The reason why the coordinate information is multiplied by 2 is because I applied double scaling when generating the image
How to reproduce the bug
import fitz
import cv2
file_path = 'data/mscbookin.pdf'
png_path = 'data/mscbookin.pdf_0.png'
pdf = fitz.open(file_path)
page = pdf.load_page(0)
image = cv2.imread(png_path)
blocks = page.get_text(option='dict', clip=fitz.INFINITE_IRECT())['blocks']
for item in blocks:
x1, y1, x2, y2 = [int(i) * 2 for i in list(item['bbox'])]
cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.imshow('Image with Rectangle', image)
cv2.waitKey(0)
cv2.destroyAllWindows()
PyMuPDF version
1.24.5
Operating system
Windows
Python version
3.8
There is a difference in the behavior of the base library. I am going to transfer this report to MuPDF's issue tracker and report the tracking number here.
Test outputs: mutool-12311.txt mutool-12404.txt
MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707843
The MuPDF team has investigated this case diligently. It turned out that the PDF contains explicit instructions ("ActualText") which cause ignoring some of the visible text.
MuPDF, like Adobe Acrobat, with its recent versions honors these ActualText instructions (and both thus ignore text pieces excluded by them), while yet other viewers in contrast ignore ActualText instructions and thus may extract text which should be ignored.
We are honoring an explicit choice by the file author who says "ignore these text particles as part of the extraction".
There is no way to switch between these alternatives.