TesseractAgent.gather_data() calculates bounding boxes incorrectly for levels other than WORD
Describe the bug
The bounding boxes returned by, for example, ocr_agent.gather_data(res, agg_level=lp.TesseractFeatureType.BLOCK) don't reflect the block size in the initial data. Looking at the code, I think by removing elements where text is NaN (https://github.com/Layout-Parser/layout-parser/blob/0809fa89fef08e34a4c73d5c1285e93ba80dc309/src/layoutparser/ocr/tesseract_agent.py#L146), it removes all levels except WORD, and so the block is only as wide as the longest word.
Checklist
- I have searched related issues but cannot get the expected help.
- The bug has not been fixed in the latest version, see the Layout Parser Releases
To Reproduce Steps to reproduce the behavior:
- What command or script did you run?
ocr_agent = lp.TesseractAgent()
res = ocr_agent.detect(image, return_response=True)
layout = ocr_agent.gather_data(res, agg_level=lp.TesseractFeatureType.BLOCK)
If you then look at any element more than one word wide, you can see the block is not as wide as would be indicated by the original data in
res["data"].loc[res["data"].level == lp.TesseractFeatureType.BLOCK+1]
Environment
- Please describe your Platform: Ubuntu Linux
- Please show the Layout Parser version: 0.3.2
Additional context Add any other context about the problem here.
Thanks for brining this up. Totally agree and yes, I planned to work on this in #81 !