The text of upgrading to a higher version PDF file cannot be obtained
Description of the bug
Using 1.24.10 can obtain PDF text normally After upgrading to version 1.25, it cannot be obtained normally The initial positioning now is an issue with page. get_text ("dict") function
How to reproduce the bug
Unable to provide PDF data
PyMuPDF version
1.26.0
Operating system
Linux
Python version
3.10
You did not attach a reproducing file.
You did not attach a reproducing file.
I cannot provide data as it is the client's private data
You can use my e-mail to forward the file. Also please provide a code snippet that demonstrates what you've tried.
You can use my e-mail to forward the file. Also please provide a code snippet that demonstrates what you've tried. The data has been privately sent to you via email
doc = pymupdf.open(file_path)
for pag in doc:
blocks = pag.get_text("dict")["blocks"]
# blocks don't include any pdf text
Certain font properties can be overruled by specifications made in the PDF. Among those properties is the so-called "font bbox": a rectangle inside which the glyphs (roughly the "images" shown for each character) of the font will fit.
Since a few versions, our base library accepts such property overrides and will disregard the respective original information in a font's binary file.
In your example, this leads to problems, because some fonts' bboxes are empty rectangles, e.g. Rect(0, 0, 0, 0).
While you can extract plain text, any extraction method that delivers text positions, like "dict", will disregard text that has an empty bbox. So you will see nothing returned in such cases.
You can use text extraction flags, that tell the base library to disregard problematic property overrides and instead re-compute the text boundary boxes.
Try the following code snippet and you will see extracted text again:
import pymupdf
doc = pymupdf.open("test.pdf")
page = doc[0]
blocks = page.get_text("dict", flags=pymupdf.TEXT_ACCURATE_BBOXES)["blocks"]
for b in blocks:
for l in b["lines"]:
print("".join([s["text"] for s in l["spans"]]))
Certain font properties can be overruled by specifications made in the PDF. Among those properties is the so-called "font bbox": a rectangle inside which the glyphs (roughly the "images" shown for each character) of the font will fit. Since a few versions, our base library accepts such property overrides and will disregard the respective original information in a font's binary file. In your example, this leads to problems, because some fonts' bboxes are empty rectangles, e.g.
Rect(0, 0, 0, 0). While you can extract plain text, any extraction method that delivers text positions, like "dict", will disregard text that has an empty bbox. So you will see nothing returned in such cases. You can use text extraction flags, that tell the base library to disregard problematic property overrides and instead re-compute the text boundary boxes. Try the following code snippet and you will see extracted text again:import pymupdf
doc = pymupdf.open("test.pdf") page = doc[0] blocks = page.get_text("dict", flags=pymupdf.TEXT_ACCURATE_BBOXES)["blocks"] for b in blocks: for l in b["lines"]: print("".join([s["text"] for s in l["spans"]]))
Although this parameter can solve the problem, is it universal? Because once modified, it will take effect on the parsing of all PDFs
General use of this option is no problem. But you are right: it shouldn't be even necessary. If you provide me a non-sensitive PDF, I will forward it to the MuPDF team for resolving this problem at the root. Your example PDF sent to my e-mail: could I delete problematic pages and use the rest?
General use of this option is no problem. But you are right: it shouldn't be even necessary. If you provide me a non-sensitive PDF, I will forward it to the MuPDF team for resolving this problem at the root. Your example PDF sent to my e-mail: could I delete problematic pages and use the rest?
Help delete name and contact information