PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

The text information obtained by get_text() is partially missing

Open 1339503169 opened this issue 1 year ago • 2 comments

Description of the bug

mscbookin.pdf mscbookin pdf_0 image

I encountered an issue while processing the file, where the string obtained using the get_text() method was missing some data compared to the original PDF

The reason why the coordinate information is multiplied by 2 is because I applied double scaling when generating the image

How to reproduce the bug

import fitz
import cv2
file_path = 'data/mscbookin.pdf'
png_path = 'data/mscbookin.pdf_0.png'

pdf = fitz.open(file_path)
page = pdf.load_page(0)
image = cv2.imread(png_path)

blocks = page.get_text(option='dict', clip=fitz.INFINITE_IRECT())['blocks']

for item in blocks:
    x1, y1, x2, y2 = [int(i) * 2 for i in list(item['bbox'])]
    cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

cv2.imshow('Image with Rectangle', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

PyMuPDF version

1.24.5

Operating system

Windows

Python version

3.8

1339503169 avatar Jun 26 '24 03:06 1339503169

There is a difference in the behavior of the base library. I am going to transfer this report to MuPDF's issue tracker and report the tracking number here.

JorjMcKie avatar Jun 26 '24 08:06 JorjMcKie

Test outputs: mutool-12311.txt mutool-12404.txt

MuPDF issue number: https://bugs.ghostscript.com/show_bug.cgi?id=707843

JorjMcKie avatar Jun 26 '24 08:06 JorjMcKie

The MuPDF team has investigated this case diligently. It turned out that the PDF contains explicit instructions ("ActualText") which cause ignoring some of the visible text.

MuPDF, like Adobe Acrobat, with its recent versions honors these ActualText instructions (and both thus ignore text pieces excluded by them), while yet other viewers in contrast ignore ActualText instructions and thus may extract text which should be ignored.

We are honoring an explicit choice by the file author who says "ignore these text particles as part of the extraction".

There is no way to switch between these alternatives.

JorjMcKie avatar Jul 18 '24 13:07 JorjMcKie