PyMuPDF Regression in custom image insertion since 1.19.3 (maybe insert

Describe the bug (mandatory)

Code that used to work in 1.19.2 seems not to work after this version, erroring in different ways in the newer versions. The code fails on this line: https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfhacks.py#L361

To Reproduce (mandatory)

Install archive-pdf-tools==1.4.13, and test with this command:

recode_pdf --dpi 300 -J pillow -I /tmp/in.png --hocr-file /tmp/in.hocr -o out.pdf

And then install a newer PyMuPDF than 1.19.2 and try the same command again, you will likely see something like this:

mupdf: invalid key in dict
Traceback (most recent call last):
  File "/home/merlijn/archive/archive-pdf-tools/venv/bin/recode_pdf", line 306, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/recode.py", line 713, in recode
    write_basic_ua(outdoc, language=lang_if_any)
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/pdfhacks.py", line 363, in write_basic_ua
    to_pdf.update_object(page.xref, page_data)
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/fitz/fitz.py", line 4912, in update_object
    return _fitz.Document_update_object(self, xref, text, page)
RuntimeError: invalid key in dict
corrupted double-linked list
Aborted

Your configuration (mandatory)

Gentoo Linux, amd64
Python 3.8, Linux amd64
PyMuPDF version 1.19.2, 1.19.3, 1.19.4, 1.19.6, from wheel

Additional context (optional)

I think the problem seems to be in how the xref_object(page.xref) call returns the page xref, in version 1.19.2, it looks like this:

<<
  /Type /Page
  /Parent 2 0 R
  /MediaBox [ 0 0 593.28 836.16 ]
  /Contents [ 10 0 R 13 0 R 16 0 R ]
  /Resources <<
    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]
    /Font <<
      /f-0-0 3 0 R
    >>
    /XObject <<
      /fzImg0 12 0 R
      /fzImg1 14 0 R
    >>
  >>
>>

But when it breaks in newer versions, it looks like this:

<<
  /Type /Page
  /Parent 2 0 R
  /MediaBox [ 0 0 593.28 836.16 ]
  /Contents [ 10 0 R 13 0 R 16 0 R ]
  /Resources <<
    /ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]
    /Font <<
      /f-0-0 3 0 R
    >>
    /XObject <<
      56 16 0 R
    >>
  >>
>>

So /fzImg0 is missing. I am not sure if this is caused by another bug which causes my code to act up, or if this is where the problem occurs. In any case, this is what seems to cause the crash.

The file pdfhacks.py contains quite a few hacks, including the fast_insert_image that we discussed in the past, so it's also possible that some other intricate change between 1.19.2 and 1.19.3 broke some of those parts, and everything else is just the problem cascading onwards.

May 03 '22 22:05 MerlijnWajer

testdata.tar.gz

You could use this data for testing

May 03 '22 22:05 MerlijnWajer

Hm... it looks like when I remove the call to write_basic_ua I get this error when saving:

recode_pdf --dpi 300 -J pillow -I /tmp/in.png --hocr-file /tmp/in.hocr -o out.pdf
evince out.pdfmupdf warnings: "'fzImg0' is no image dict (0 0 R)\n'fzImg0' is no form dict (0 0 R)"
Processed 1 pages at 6.81 seconds/page

May 03 '22 23:05 MerlijnWajer

So it looks like indeed some other code breaks...

May 03 '22 23:05 MerlijnWajer

Ok, I've confirmed that the fast_insert_image has indeed broken since 1.19.2, let me see if I can figure out where..

May 03 '22 23:05 MerlijnWajer

It looks like the contents generated by fast_insert_image is the same, so maybe something changed for Document.insert_image(rect, xref=foo) ?

May 03 '22 23:05 MerlijnWajer

There was an intermittent error in version >= 1.19.0 which caused segment faults at the end of the scripts - but only if fitz Documents had not been explicitely closed and Page or other objects were still alive.

In those cases, Python's final garbage collection deleted those objects, but not always in the necessary sequence: first Page, then Document. Sometimes the document was deleted, and then the page, which however needs the owning document for its deletion -> boom! For a reason I yet do not understand, Python ignores some of the code in my Document.__del__() which is there to prevent such things from happening.

In the current 1.19.7, I am using atexit now to close any open documents. This has solved all of my test cases. What should also work is explicitely closing every document, always. Or always use documents as a context manager (which closes at its __exit__()). In case of unhandled exceptions this may however not always work either.

May 04 '22 08:05 JorjMcKie

Maybe this is of some help: https://peps.python.org/pep-0442/

My code above worked on versions before 1.19.3 I think, so could it be an unrelated change? I did see an added pdf_drop_obj (and self.thisown = False) call in the diff between 1.19.2 and 1.19.3, as well as some other insert_image changes.

May 04 '22 09:05 MerlijnWajer

I actually believe this error is not related to errors at the end of the scripts, I'm pretty sure it's related to how the fast_insert_image works (which you contributed and I changed a bit), as there img ref names seem to disappear.

May 04 '22 09:05 MerlijnWajer

I am a little short of time these days, so cannot really track down the error shortterm.

But there was indeed a change: I realized that MuPDF image insertion functions automatically find out whether the image has a transparency mask. And if so, do the required creation of associated PDF mask objects. This includes compression with the adequate compression filters DCTDecode, CCITTFaxDecode, FlateDecode, etc. per image type.

So I removed my previous code that handled image masks via pixmaps.

I don't know if this sheds some light on the error. May also be that you do no longer need that fast insertion hack - probably worth finding out.

May 04 '22 12:05 JorjMcKie

Would you be able to test again with PyMuPDF-1.20.1? It was released a few hours ago.

The 1.20.x releases have a fix for a potential use-after-free issue.

Even if this doesn't fix this bug, it would be good to see exactly what goes wrong with the current code.

Jun 27 '22 19:06 julian-smith-artifex-com

Looks like the problem I reported originally is still there (page to change pageCount to page_count in a few places):

Traceback (most recent call last):
  File "/home/merlijn/archive/archive-pdf-tools/venv/bin/recode_pdf", line 294, in <module>
    res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/recode.py", line 726, in recode
    write_basic_ua(outdoc, language=lang_if_any)
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/pdfhacks.py", line 361, in write_basic_ua
    to_pdf.update_object(page.xref, page_data)
  File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/fitz/fitz.py", line 4878, in update_object
    return _fitz.Document_update_object(self, xref, text, page)
RuntimeError: invalid key in dict
corrupted double-linked list
Aborted

I haven't tried to dive into how fast_insert_image changed somehow.

Jun 28 '22 18:06 MerlijnWajer

Apologies for the delay in replying.

I've reproduced the problem with pymupdf-1.20.2 and also with current pymupdf master branch + mupdf master branch.

I think PyMuPDF may have an incorrect call of pdf_drop_obj() leading to unsafe use of freed memory. For me, with this call removed, the recode_pdf command succeeds. So hopefully we'll be able to fix this in the next release.

Oct 14 '22 15:10 julian-smith-artifex-com

Thank you for confirming, please let me know if you'd like me to test a potential fix. :)

Oct 14 '22 15:10 MerlijnWajer

The diff is actually very simple, so you could try it if you like:

diff --git a/fitz/fitz.i b/fitz/fitz.i
index db69d0c..abed9ef 100644
--- a/fitz/fitz.i
+++ b/fitz/fitz.i
@@ -6783,7 +6783,6 @@ if not sanitize and not self.is_wrapped:
                     h = pdf_to_int(gctx,
                         pdf_dict_geta(gctx, ref,
                         PDF_NAME(Height), PDF_NAME(H)));
-                    pdf_drop_obj(gctx, ref);
                     if ((w + h) == 0) {
                         RAISEPY(gctx, MSG_IS_NO_IMAGE, PyExc_ValueError);
                     }

This is a diff relative to current master, so might not work with older PyMuPDF code.

Oct 14 '22 21:10 julian-smith-artifex-com

Fixed in 1.21.0

Nov 08 '22 16:11 julian-smith-artifex-com

Thanks, with the new release I have been able to confirm that the problem is indeed fixed. Much appreciated, I can bring the PDF stack of archive.org to the new MuPDF now! :)

Nov 08 '22 17:11 MerlijnWajer

That's great, and thanks for letting us know.

Nov 08 '22 17:11 julian-smith-artifex-com

Regression in custom image insertion since 1.19.3 (maybe insert_image bug?)

Describe the bug (mandatory)

To Reproduce (mandatory)

Your configuration (mandatory)

Additional context (optional)