PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Capture printed error when input pdf file is corrupted

Open han-xiao-upright opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe.

I'd like to implement a functionality which processes multiple PDF files one by one.

Some PDFs are "valid" while some of the PDF files are corrupted, in which case they should be ignored.

In my case, reading corrupted PDF files make pymupdf prints out a list of errors, instead of raising them.

For instance, the following code

import pymupdf

doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100)

gives

MuPDF error: library error: zlib error: incorrect header check

MuPDF error: format error: cmsOpenProfileFromMem failed

MuPDF error: library error: zlib error: incorrect header check

MuPDF error: syntax error: syntax error in content stream

MuPDF error: syntax error: syntax error in content stream

Describe the solution you'd like

Arguable, raising the errors makes handling them easier. So I would like the following:

  • instead of printing the error and continue to subsequent code, raise an exception and stop there

Perhaps via a configurable keyword argument

doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100, raise_error=True)

Describe alternatives you've considered Are there several options for how your request could be met?

Additional context Add any other context or screenshots about the feature request here.

han-xiao-upright avatar Jul 02 '24 12:07 han-xiao-upright

Not all errors require raising an exception. On the contrary, MuPDF strives to keep processing by falling back to whatever repair mechanisms. You can suppress the display of these messages by setting a global parameter via pymupdf.TOOLS.mupdf_display_errors(False). In any case, you can extract the error and warning messages (all collected in the same pymupdf string variable) via pymupdf.TOOLS.mupdf_warnings(reset=True). Each call with True empties that variable. In your case you already today could do this

import pymupdf
pymupdf.TOOLS.mupdf_display_errors(False)

# then, at any desired spot (e.g. pixmap creation) do this:
pix = page.get_pixmap()
msg = pymupdf.TOOLS.mupdf_warnings(reset=True)
if "error" in msg:
    raise RuntimeError(msg)

I am very much against an implementation as you indicated it. How many dozens or hundreds of methods would we have to change? A focused implementation like indicated above serves the same purpose.

JorjMcKie avatar Jul 02 '24 13:07 JorjMcKie

Thanks for the insight! Closing the issue.

han-xiao-upright avatar Jul 03 '24 06:07 han-xiao-upright