amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

`get_blocks_by_type` does not correctly handle pages without relationships (e.g. blank pages) (Python)

Open MattExact opened this issue 2 years ago • 0 comments

If you call TDocument.get_blocks_by_type on a page with no relationships it will instead return as if you called it for the whole document. E.g. Calling TDocument.tables on a blank page will return all tables in the document. I believe this is unwanted and unintended behaviour.

This is due to the condition if page and page.relationships:. In the case of no relationships for the page, the condition evaluates to False. So instead the blocks returned are for the whole document.

TDocument.relationships_recursive is used to get the list of blocks on the page, which should handle when the page block has no relationships. Therefore I think this condition can just be if page:?

https://github.com/aws-samples/amazon-textract-response-parser/blob/541c07a12d603deed70699357f865d6974369c7b/src-python/trp/trp2.py#L660-L680

MattExact avatar Jul 25 '23 15:07 MattExact