PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

page.cluster_drawings extract a lot of small clusters once upgraded to 1.26

Open klauswong opened this issue 7 months ago • 2 comments

Description of the bug

I have a function that extract the clustered drawings from a PDF. This function takes much longer time after after upgraded to 1.26.0 (and 1.26.3)

Here is a simplified version of the function to isolate the problem

    def _get_clustered_drawings(
        self, page: fitz.Page
    ) -> List[ImageType]:
        for clip in page.cluster_drawings():
            print(clip)

How to reproduce the bug

This is a simple PDF exported from Perplexity (with images) what is the tallest mountain on earth.pdf

When using getting the clustered drawings pymupdf<1.26.0, I get 3 clustered drawings and the speed feels 'normal'.

Rect(75.75, 222.75, 79.5, 226.5)
Rect(75.75, 243.75, 79.5, 247.5)
Rect(75.75, 296.25, 79.5, 300.0)

With version >= 1.26.0, the clusters, I get this long list of clusters with significantly longer time. The problem magnifies for a longer PDF with more images.

Rect(68.42168426513672, 104.17657470703125, 114.84117889404297, 118.68190002441406)
Rect(140.42724609375, 104.17657470703125, 170.42828369140625, 118.72410583496094)
Rect(239.57530212402344, 103.84222412109375, 326.4324951171875, 118.72410583496094)
Rect(332.87890625, 107.958251953125, 354.9324951171875, 118.72410583496094)
Rect(361.37890625, 104.17657470703125, 409.683837890625, 118.72410583496094)
Rect(120.88815307617188, 103.84222412109375, 135.1599884033203, 118.734619140625)
Rect(175.67724609375, 104.17657470703125, 233.34117126464844, 118.734619140625)
...

PyMuPDF version

1.26.3

Operating system

MacOS

Python version

3.12

klauswong avatar Jul 09 '25 09:07 klauswong

Confirming your observation. This only happens when there is text written with a Type 3 font. In this case, the vector graphics representing the Type 3 character are being included in the .get_drawings() extraction. For example in the following picture the red rectangle is the vector and the blue rectangle is the character bbox:

Image

We are currently investigating with the MuPDF team ...

JorjMcKie avatar Jul 09 '25 13:07 JorjMcKie

MuPDF issue link: https://bugs.ghostscript.com/show_bug.cgi?id=708875

JorjMcKie avatar Oct 03 '25 14:10 JorjMcKie