TesseractAgent.gather_data() calculates bounding boxes incorrectly for levels other than WORD

Open becdridan opened this issue 4 years ago • 1 comments

Describe the bug The bounding boxes returned by, for example, ocr_agent.gather_data(res, agg_level=lp.TesseractFeatureType.BLOCK) don't reflect the block size in the initial data. Looking at the code, I think by removing elements where text is NaN (https://github.com/Layout-Parser/layout-parser/blob/0809fa89fef08e34a4c73d5c1285e93ba80dc309/src/layoutparser/ocr/tesseract_agent.py#L146), it removes all levels except WORD, and so the block is only as wide as the longest word.

Checklist

I have searched related issues but cannot get the expected help.
The bug has not been fixed in the latest version, see the Layout Parser Releases

To Reproduce Steps to reproduce the behavior:

What command or script did you run?

ocr_agent = lp.TesseractAgent()
res = ocr_agent.detect(image, return_response=True)
layout = ocr_agent.gather_data(res, agg_level=lp.TesseractFeatureType.BLOCK)

If you then look at any element more than one word wide, you can see the block is not as wide as would be indicated by the original data in

res["data"].loc[res["data"].level == lp.TesseractFeatureType.BLOCK+1]

Environment

Please describe your Platform: Ubuntu Linux
Please show the Layout Parser version: 0.3.2

Additional context Add any other context about the problem here.

Feb 14 '22 22:02 becdridan

Thanks for brining this up. Totally agree and yes, I planned to work on this in #81 !

Feb 15 '22 22:02 lolipopshock