Capture printed error when input pdf file is corrupted
Is your feature request related to a problem? Please describe.
I'd like to implement a functionality which processes multiple PDF files one by one.
Some PDFs are "valid" while some of the PDF files are corrupted, in which case they should be ignored.
In my case, reading corrupted PDF files make pymupdf prints out a list of errors, instead of raising them.
For instance, the following code
import pymupdf
doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100)
gives
MuPDF error: library error: zlib error: incorrect header check
MuPDF error: format error: cmsOpenProfileFromMem failed
MuPDF error: library error: zlib error: incorrect header check
MuPDF error: syntax error: syntax error in content stream
MuPDF error: syntax error: syntax error in content stream
Describe the solution you'd like
Arguable, raising the errors makes handling them easier. So I would like the following:
- instead of printing the error and continue to subsequent code, raise an exception and stop there
Perhaps via a configurable keyword argument
doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100, raise_error=True)
Describe alternatives you've considered Are there several options for how your request could be met?
Additional context Add any other context or screenshots about the feature request here.
Not all errors require raising an exception. On the contrary, MuPDF strives to keep processing by falling back to whatever repair mechanisms.
You can suppress the display of these messages by setting a global parameter via pymupdf.TOOLS.mupdf_display_errors(False).
In any case, you can extract the error and warning messages (all collected in the same pymupdf string variable) via pymupdf.TOOLS.mupdf_warnings(reset=True). Each call with True empties that variable.
In your case you already today could do this
import pymupdf
pymupdf.TOOLS.mupdf_display_errors(False)
# then, at any desired spot (e.g. pixmap creation) do this:
pix = page.get_pixmap()
msg = pymupdf.TOOLS.mupdf_warnings(reset=True)
if "error" in msg:
raise RuntimeError(msg)
I am very much against an implementation as you indicated it. How many dozens or hundreds of methods would we have to change? A focused implementation like indicated above serves the same purpose.
Thanks for the insight! Closing the issue.