PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

The text of upgrading to a higher version PDF file cannot be obtained

Open QiusongYang opened this issue 7 months ago • 8 comments

Description of the bug

Using 1.24.10 can obtain PDF text normally After upgrading to version 1.25, it cannot be obtained normally The initial positioning now is an issue with page. get_text ("dict") function

How to reproduce the bug

Unable to provide PDF data

PyMuPDF version

1.26.0

Operating system

Linux

Python version

3.10

QiusongYang avatar Jun 03 '25 06:06 QiusongYang

You did not attach a reproducing file.

JorjMcKie avatar Jun 03 '25 06:06 JorjMcKie

You did not attach a reproducing file.

I cannot provide data as it is the client's private data

QiusongYang avatar Jun 04 '25 06:06 QiusongYang

You can use my e-mail to forward the file. Also please provide a code snippet that demonstrates what you've tried.

JorjMcKie avatar Jun 04 '25 06:06 JorjMcKie

You can use my e-mail to forward the file. Also please provide a code snippet that demonstrates what you've tried. The data has been privately sent to you via email

doc = pymupdf.open(file_path)
for pag in doc:
    blocks = pag.get_text("dict")["blocks"]
    # blocks don't include any pdf text

QiusongYang avatar Jun 04 '25 07:06 QiusongYang

Certain font properties can be overruled by specifications made in the PDF. Among those properties is the so-called "font bbox": a rectangle inside which the glyphs (roughly the "images" shown for each character) of the font will fit. Since a few versions, our base library accepts such property overrides and will disregard the respective original information in a font's binary file. In your example, this leads to problems, because some fonts' bboxes are empty rectangles, e.g. Rect(0, 0, 0, 0). While you can extract plain text, any extraction method that delivers text positions, like "dict", will disregard text that has an empty bbox. So you will see nothing returned in such cases. You can use text extraction flags, that tell the base library to disregard problematic property overrides and instead re-compute the text boundary boxes. Try the following code snippet and you will see extracted text again:

import pymupdf

doc = pymupdf.open("test.pdf")
page = doc[0]
blocks = page.get_text("dict", flags=pymupdf.TEXT_ACCURATE_BBOXES)["blocks"]
for b in blocks:
    for l in b["lines"]:
        print("".join([s["text"] for s in l["spans"]]))

JorjMcKie avatar Jun 04 '25 09:06 JorjMcKie

Certain font properties can be overruled by specifications made in the PDF. Among those properties is the so-called "font bbox": a rectangle inside which the glyphs (roughly the "images" shown for each character) of the font will fit. Since a few versions, our base library accepts such property overrides and will disregard the respective original information in a font's binary file. In your example, this leads to problems, because some fonts' bboxes are empty rectangles, e.g. Rect(0, 0, 0, 0). While you can extract plain text, any extraction method that delivers text positions, like "dict", will disregard text that has an empty bbox. So you will see nothing returned in such cases. You can use text extraction flags, that tell the base library to disregard problematic property overrides and instead re-compute the text boundary boxes. Try the following code snippet and you will see extracted text again:

import pymupdf

doc = pymupdf.open("test.pdf") page = doc[0] blocks = page.get_text("dict", flags=pymupdf.TEXT_ACCURATE_BBOXES)["blocks"] for b in blocks: for l in b["lines"]: print("".join([s["text"] for s in l["spans"]]))

Although this parameter can solve the problem, is it universal? Because once modified, it will take effect on the parsing of all PDFs

QiusongYang avatar Jun 05 '25 02:06 QiusongYang

General use of this option is no problem. But you are right: it shouldn't be even necessary. If you provide me a non-sensitive PDF, I will forward it to the MuPDF team for resolving this problem at the root. Your example PDF sent to my e-mail: could I delete problematic pages and use the rest?

JorjMcKie avatar Jun 05 '25 12:06 JorjMcKie

General use of this option is no problem. But you are right: it shouldn't be even necessary. If you provide me a non-sensitive PDF, I will forward it to the MuPDF team for resolving this problem at the root. Your example PDF sent to my e-mail: could I delete problematic pages and use the rest?

Help delete name and contact information

QiusongYang avatar Jun 06 '25 04:06 QiusongYang