Got "malloc(): unaligned tcache chunk detected Aborted (core dumped)" while using add_redact_annot/apply_redactions
Description of the bug
I was trying to remove all text from PDF files. My python script looks like the following:
for page in document:
info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
for block_ind, block in enumerate(info['blocks']):
for line_ind, line in enumerate(block['lines']):
for span_ind, span in enumerate(line['spans']):
# print(span)
page.add_redact_annot(fitz.Rect(*span['bbox']))
page.apply_redactions()
This code works well, but notice the # print(span). If I print the infos, i would get malloc(): unaligned tcache chunk detected Aborted (core dumped).
This is really strange to me.
Do I need to upload th PDF files or other informations? Because the files contain personal information, I am not willing to upload it to be honest.
How to reproduce the bug
smiply comment/uncomment the print line would reproduce the bug.
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.10
You can send me the file via mail, so it won't be exposed here. Is this the only file showing the problem? I also am a little confused: Why do you extract all text at all if you want to remove it anyway? You can simply add one redaction annotation covering the full page. But you should add options to apply_redactions that prevent removal of images and graphics. You don't do that currently albeit your text might overlap such objects... Anyway, we cannot follow up the problem without a file at hand.
Hello, just send you an email with the problem file. It is the only file with the problem.
And by the way, if I apply_redactions each time after add_redact_annot. the code works well -- without error and correct result.
for page in document:
info = json.loads(page.get_text('json', flags=fitz.TEXTFLAGS_TEXT))
for block_ind, block in enumerate(info['blocks']):
for line_ind, line in enumerate(block['lines']):
for span_ind, span in enumerate(line['spans']):
print(span)
page.add_redact_annot(fitz.Rect(*span['bbox']))
page.apply_redactions()
Thanks for the file. I was able to reproduce the problem - but only under Linux: it runs fine under Windows. I used the following simplified script by the way - no need to make a json string which you immediately convert back to a Python dictionary. Also note that there is no need to convert 4-tuples to rectangles: all PyMuPDF methods will detect Python sequences where points, rectangles or matrices are expected and does the necessary conversions.
import pymupdf
doc = pymupdf.open("test.pdf")
page = doc[0]
blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_TEXT)["blocks"]
spans = [s for b in blocks for l in b["lines"] for s in l["spans"]]
for s in spans:
page.add_redact_annot(s["bbox"])
page.apply_redactions()
print(f"{len(spans)} annots created")
doc.ez_save("redacted.pdf")
This script runs under Windows, but gets the malloc error under Linux.
So how do you want to proceed: we will need to get the MuPDF team involved for a solution, so they would also need the reproducing file - for which I need your ok. Of course PyMuPDF and MuPDF are all maintained by the same company Artifex, so confidentiality is secured in any case.
Yes sure, you could share the file with your team.
Thank you for the improving codes.
I'm seeing the same issue on some documents. Unfortunately I'm not able to share them.
Is there a place where we can follow the progress on this issue on MuPDF's side of things ?
In the meantime, did someone find a workaround for this issue when it happens ?
We have a fix for the problem in MuPDF.
I don't yet know when this will be available for use in a PyMuPDF release.
I am also facing same issue. I am using PyMupdf libraray to apply annotaton and redaction . but get intermittent issue of "malloc(): unaligned tcache chunk detected ". Mostly it happen when multiple request come but have seen in single request too.
Method which give issues
page.add_redact_annot(r)
Fixed in PyMuPDF-1.24.14.
Fixed in PyMuPDF-1.24.14. Thanks for the update