PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

Got "malloc(): unaligned tcache chunk detected Aborted (core dumped)" while using add_redact_annot/apply_redactions

Open JiahuanChen opened this issue 1 year ago • 5 comments

Description of the bug

I was trying to remove all text from PDF files. My python script looks like the following:

for page in document:
    info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
    for block_ind, block in enumerate(info['blocks']):
        for line_ind, line in enumerate(block['lines']):
            for span_ind, span in enumerate(line['spans']):
                # print(span)
                page.add_redact_annot(fitz.Rect(*span['bbox']))
    page.apply_redactions()

This code works well, but notice the # print(span). If I print the infos, i would get malloc(): unaligned tcache chunk detected Aborted (core dumped).

This is really strange to me.

Do I need to upload th PDF files or other informations? Because the files contain personal information, I am not willing to upload it to be honest.

How to reproduce the bug

smiply comment/uncomment the print line would reproduce the bug.

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.10

JiahuanChen avatar Aug 08 '24 03:08 JiahuanChen

You can send me the file via mail, so it won't be exposed here. Is this the only file showing the problem? I also am a little confused: Why do you extract all text at all if you want to remove it anyway? You can simply add one redaction annotation covering the full page. But you should add options to apply_redactions that prevent removal of images and graphics. You don't do that currently albeit your text might overlap such objects... Anyway, we cannot follow up the problem without a file at hand.

JorjMcKie avatar Aug 08 '24 06:08 JorjMcKie

Hello, just send you an email with the problem file. It is the only file with the problem.

And by the way, if I apply_redactions each time after add_redact_annot. the code works well -- without error and correct result.

for page in document:
    info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
    for block_ind, block in enumerate(info['blocks']):
        for line_ind, line in enumerate(block['lines']):
            for span_ind, span in enumerate(line['spans']):
                print(span)
                page.add_redact_annot(fitz.Rect(*span['bbox']))
                page.apply_redactions()

JiahuanChen avatar Aug 08 '24 09:08 JiahuanChen

Thanks for the file. I was able to reproduce the problem - but only under Linux: it runs fine under Windows. I used the following simplified script by the way - no need to make a json string which you immediately convert back to a Python dictionary. Also note that there is no need to convert 4-tuples to rectangles: all PyMuPDF methods will detect Python sequences where points, rectangles or matrices are expected and does the necessary conversions.

import pymupdf


doc = pymupdf.open("test.pdf")
page = doc[0]
blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
spans = [s for b in blocks for l in b["lines"] for s in l["spans"]]
for s in spans:
    page.add_redact_annot(s["bbox"])
page.apply_redactions()
print(f"{len(spans)} annots created")
doc.ez_save("redacted.pdf")

This script runs under Windows, but gets the malloc error under Linux.

So how do you want to proceed: we will need to get the MuPDF team involved for a solution, so they would also need the reproducing file - for which I need your ok. Of course PyMuPDF and MuPDF are all maintained by the same company Artifex, so confidentiality is secured in any case.

JorjMcKie avatar Aug 08 '24 19:08 JorjMcKie

Yes sure, you could share the file with your team.

Thank you for the improving codes.

JiahuanChen avatar Aug 09 '24 00:08 JiahuanChen

I'm seeing the same issue on some documents. Unfortunately I'm not able to share them.

Is there a place where we can follow the progress on this issue on MuPDF's side of things ?

In the meantime, did someone find a workaround for this issue when it happens ?

wapiflapi avatar Oct 11 '24 16:10 wapiflapi

We have a fix for the problem in MuPDF.

I don't yet know when this will be available for use in a PyMuPDF release.

I am also facing same issue. I am using PyMupdf libraray to apply annotaton and redaction . but get intermittent issue of "malloc(): unaligned tcache chunk detected ". Mostly it happen when multiple request come but have seen in single request too.

Method which give issues page.add_redact_annot(r)

shubham1809 avatar Nov 15 '24 09:11 shubham1809

Fixed in PyMuPDF-1.24.14.

Fixed in PyMuPDF-1.24.14. Thanks for the update

kkk935208447 avatar Nov 19 '24 16:11 kkk935208447