Documentation mismatch for `get_text_blocks` return value order

Open thelewisking opened this issue 1 year ago • 0 comments

Description of the bug

Hello!

The function docstring for get_text_blocks in utilts.py suggests the return values as follows:

Returns:
        A list of the blocks. Each item contains the containing rectangle
        coordinates, text lines, block type and running block number.

This is slightly misleading as when this function calls textpage.extractBLOCKS() it returns:

litem = (
                        blockrect.x0,
                        blockrect.y0,
                        blockrect.x1,
                        blockrect.y1,
                        text,
                        block_n,
                        block.m_internal.type,
                        )

The actual order of the get_text_blocks return is therefore: "coordinates...", "text", "block type", "running block number".

This discrepancy could potentially lead to confusion when using the fuction. I propose updating the docstring to reflect the actual order of the returned items.

How to reproduce the bug

Using the below sample PDF: sample-pdf-file.pdf

We can extract its blocks of texts:

with fitz.open(r"sample-pdf-file.pdf") as f:
    text = [page.get_text("blocks") for page in f]

assert text[0][1][-1] == 1

The assertion will raise an Exception because the last item in the array for this block (block number 2, array position 1) is suggested to be the "running block number" and therefore a "1" as it was the 2nd block extracted. 1, 0)

Block Type: 0 - Image Block Type: 1 - Text

PyMuPDF version

1.24.1

Operating system

Windows

Python version

3.12

Apr 14 '24 01:04 thelewisking