Documentation mismatch for `get_text_blocks` return value order
Description of the bug
Hello!
The function docstring for get_text_blocks in utilts.py suggests the return values as follows:
Returns:
A list of the blocks. Each item contains the containing rectangle
coordinates, text lines, block type and running block number.
This is slightly misleading as when this function calls textpage.extractBLOCKS() it returns:
litem = (
blockrect.x0,
blockrect.y0,
blockrect.x1,
blockrect.y1,
text,
block_n,
block.m_internal.type,
)
The actual order of the get_text_blocks return is therefore: "coordinates...", "text", "block type", "running block number".
This discrepancy could potentially lead to confusion when using the fuction. I propose updating the docstring to reflect the actual order of the returned items.
How to reproduce the bug
Using the below sample PDF: sample-pdf-file.pdf
We can extract its blocks of texts:
with fitz.open(r"sample-pdf-file.pdf") as f:
text = [page.get_text("blocks") for page in f]
assert text[0][1][-1] == 1
The assertion will raise an Exception because the last item in the array for this block (block number 2, array position 1) is suggested to be the "running block number" and therefore a "1" as it was the 2nd block extracted.
1, 0)
Block Type: 0 - Image Block Type: 1 - Text
PyMuPDF version
1.24.1
Operating system
Windows
Python version
3.12