Regression in custom image insertion since 1.19.3 (maybe insert_image bug?)
Describe the bug (mandatory)
Code that used to work in 1.19.2 seems not to work after this version, erroring in different ways in the newer versions. The code fails on this line: https://github.com/internetarchive/archive-pdf-tools/blob/master/internetarchivepdf/pdfhacks.py#L361
To Reproduce (mandatory)
Install archive-pdf-tools==1.4.13, and test with this command:
recode_pdf --dpi 300 -J pillow -I /tmp/in.png --hocr-file /tmp/in.hocr -o out.pdf
And then install a newer PyMuPDF than 1.19.2 and try the same command again, you will likely see something like this:
mupdf: invalid key in dict
Traceback (most recent call last):
File "/home/merlijn/archive/archive-pdf-tools/venv/bin/recode_pdf", line 306, in <module>
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/recode.py", line 713, in recode
write_basic_ua(outdoc, language=lang_if_any)
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/pdfhacks.py", line 363, in write_basic_ua
to_pdf.update_object(page.xref, page_data)
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/fitz/fitz.py", line 4912, in update_object
return _fitz.Document_update_object(self, xref, text, page)
RuntimeError: invalid key in dict
corrupted double-linked list
Aborted
Your configuration (mandatory)
- Gentoo Linux, amd64
- Python 3.8, Linux amd64
- PyMuPDF version 1.19.2, 1.19.3, 1.19.4, 1.19.6, from wheel
Additional context (optional)
I think the problem seems to be in how the xref_object(page.xref) call returns the page xref, in version 1.19.2, it looks like this:
<<
/Type /Page
/Parent 2 0 R
/MediaBox [ 0 0 593.28 836.16 ]
/Contents [ 10 0 R 13 0 R 16 0 R ]
/Resources <<
/ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]
/Font <<
/f-0-0 3 0 R
>>
/XObject <<
/fzImg0 12 0 R
/fzImg1 14 0 R
>>
>>
>>
But when it breaks in newer versions, it looks like this:
<<
/Type /Page
/Parent 2 0 R
/MediaBox [ 0 0 593.28 836.16 ]
/Contents [ 10 0 R 13 0 R 16 0 R ]
/Resources <<
/ProcSet [ /PDF /Text /ImageB /ImageI /ImageC ]
/Font <<
/f-0-0 3 0 R
>>
/XObject <<
56 16 0 R
>>
>>
>>
So /fzImg0 is missing. I am not sure if this is caused by another bug which causes my code to act up, or if this is where the problem occurs. In any case, this is what seems to cause the crash.
The file pdfhacks.py contains quite a few hacks, including the fast_insert_image that we discussed in the past, so it's also possible that some other intricate change between 1.19.2 and 1.19.3 broke some of those parts, and everything else is just the problem cascading onwards.
Hm... it looks like when I remove the call to write_basic_ua I get this error when saving:
recode_pdf --dpi 300 -J pillow -I /tmp/in.png --hocr-file /tmp/in.hocr -o out.pdf
evince out.pdfmupdf warnings: "'fzImg0' is no image dict (0 0 R)\n'fzImg0' is no form dict (0 0 R)"
Processed 1 pages at 6.81 seconds/page
So it looks like indeed some other code breaks...
Ok, I've confirmed that the fast_insert_image has indeed broken since 1.19.2, let me see if I can figure out where..
It looks like the contents generated by fast_insert_image is the same, so maybe something changed for Document.insert_image(rect, xref=foo) ?
There was an intermittent error in version >= 1.19.0 which caused segment faults at the end of the scripts - but only if fitz Documents had not been explicitely closed and Page or other objects were still alive.
In those cases, Python's final garbage collection deleted those objects, but not always in the necessary sequence: first Page, then Document. Sometimes the document was deleted, and then the page, which however needs the owning document for its deletion -> boom!
For a reason I yet do not understand, Python ignores some of the code in my Document.__del__() which is there to prevent such things from happening.
In the current 1.19.7, I am using atexit now to close any open documents. This has solved all of my test cases.
What should also work is explicitely closing every document, always. Or always use documents as a context manager (which closes at its __exit__()). In case of unhandled exceptions this may however not always work either.
Maybe this is of some help: https://peps.python.org/pep-0442/
My code above worked on versions before 1.19.3 I think, so could it be an unrelated change? I did see an added pdf_drop_obj (and self.thisown = False) call in the diff between 1.19.2 and 1.19.3, as well as some other insert_image changes.
I actually believe this error is not related to errors at the end of the scripts, I'm pretty sure it's related to how the fast_insert_image works (which you contributed and I changed a bit), as there img ref names seem to disappear.
I am a little short of time these days, so cannot really track down the error shortterm.
But there was indeed a change: I realized that MuPDF image insertion functions automatically find out whether the image has a transparency mask. And if so, do the required creation of associated PDF mask objects. This includes compression with the adequate compression filters DCTDecode, CCITTFaxDecode, FlateDecode, etc. per image type.
So I removed my previous code that handled image masks via pixmaps.
I don't know if this sheds some light on the error. May also be that you do no longer need that fast insertion hack - probably worth finding out.
Would you be able to test again with PyMuPDF-1.20.1? It was released a few hours ago.
The 1.20.x releases have a fix for a potential use-after-free issue.
Even if this doesn't fix this bug, it would be good to see exactly what goes wrong with the current code.
Looks like the problem I reported originally is still there (page to change pageCount to page_count in a few places):
Traceback (most recent call last):
File "/home/merlijn/archive/archive-pdf-tools/venv/bin/recode_pdf", line 294, in <module>
res = recode(args.from_pdf, args.from_imagestack, args.dpi, args.hocr_file,
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/recode.py", line 726, in recode
write_basic_ua(outdoc, language=lang_if_any)
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/internetarchivepdf/pdfhacks.py", line 361, in write_basic_ua
to_pdf.update_object(page.xref, page_data)
File "/home/merlijn/archive/archive-pdf-tools/venv/lib/python3.8/site-packages/fitz/fitz.py", line 4878, in update_object
return _fitz.Document_update_object(self, xref, text, page)
RuntimeError: invalid key in dict
corrupted double-linked list
Aborted
I haven't tried to dive into how fast_insert_image changed somehow.
Apologies for the delay in replying.
I've reproduced the problem with pymupdf-1.20.2 and also with current pymupdf master branch + mupdf master branch.
I think PyMuPDF may have an incorrect call of pdf_drop_obj() leading to unsafe use of freed memory. For me, with this call removed, the recode_pdf command succeeds. So hopefully we'll be able to fix this in the next release.
Thank you for confirming, please let me know if you'd like me to test a potential fix. :)
The diff is actually very simple, so you could try it if you like:
diff --git a/fitz/fitz.i b/fitz/fitz.i
index db69d0c..abed9ef 100644
--- a/fitz/fitz.i
+++ b/fitz/fitz.i
@@ -6783,7 +6783,6 @@ if not sanitize and not self.is_wrapped:
h = pdf_to_int(gctx,
pdf_dict_geta(gctx, ref,
PDF_NAME(Height), PDF_NAME(H)));
- pdf_drop_obj(gctx, ref);
if ((w + h) == 0) {
RAISEPY(gctx, MSG_IS_NO_IMAGE, PyExc_ValueError);
}
This is a diff relative to current master, so might not work with older PyMuPDF code.
Fixed in 1.21.0
Thanks, with the new release I have been able to confirm that the problem is indeed fixed. Much appreciated, I can bring the PDF stack of archive.org to the new MuPDF now! :)
That's great, and thanks for letting us know.