doctr icon indicating copy to clipboard operation
doctr copied to clipboard

Crop intersecting bounding boxes to improve precision

Open zas97 opened this issue 3 years ago • 12 comments

🚀 The feature

The "detector" inside the ocr_predictor often generates bounding boxes that intersect with each other horizontally, see examples : image image image

When this happens the ocr results contains some characters that are doubled or some extra wrong characters that appear, for the examples above here is the text extracted :

Tip: click ont thet top-rightc tcorner rofanimaget toe enlargei it!

Document: selection

Here are youra analysis results inJ JSONf format:

I believe that adding a post-processing step after the detector that checks whether the next box horizontally intersects with the current box and crops it in that case would help correct the issue.

Motivation, pitch

I'm working on a project where we have to read the french social security numbers in a variety of documents. I found in my benchmarks that even if doctr is able to extract a lot more text from most documents, it is worse at reading the social security than tesseract because of this particular error. I believe that adding this post-processing step would largely improve results.

Alternatives

No response

Additional context

Code for reproducing the issue :

%matplotlib inline
import os

# Let's pick the desired backend
os.environ['USE_TORCH'] = '1'

from doctr.io import DocumentFile, Document
from doctr.models import ocr_predictor

predictor = ocr_predictor(pretrained=True, assume_straight_pages=True, preserve_aspect_ratio=False)
doc = DocumentFile.from_images(r"C:\Users\jcapellgracia\Downloads\demo_update.png")
result = predictor(doc)
print(result.pages[0].render())
result.show(doc)

Image used :

demo_update

zas97 avatar Apr 21 '22 08:04 zas97

Hi @zas97 :wave: ,

thank you for bringing this to the table. This is already a known issue (#330). If you want to address the issue, please feel free to open a pull request. ping @charlesmindee at this point

felixdittrich92 avatar Apr 21 '22 09:04 felixdittrich92

Hi @zas97, This is indeed a good idea! This only reserve is that it could considerably slow down the end-to-end model, because we would have to compute intersections between nearest neighbors of each box or at least between successive boxes, and after that editing the overlapping boxes. Do you have an idea to implement it in a light way ? Anyway we could put it as an option in the predictor, to be used only in case of dense documents.

charlesmindee avatar Apr 22 '22 07:04 charlesmindee

Good idea @zas97 :+1:

Let's explore this for v0.6.0. To move forward with this, there are a few things to discuss:

  1. assuming we have all the intersection information: what do we want to do? let's take the example of two boxes on a given text "Hello world", we have a few cases if there is an intersection:

    • boxes ("Hello", "oworld")
    • boxes ("Hellow", "oworld")
    • boxes ("Hellow", "world")

    The problem now is that we don't have accurate spatial cues for each predicted characters. So the safest case is when we have both a text intersection & a spatial intersection, but what do we do then? Random directional (left or right) cropping? Without semantic, again since we don't know if there is more white space left or right of the letter "o" in the first example, we can't do much confidently :/

  2. evaluating intersections: do we want a symmetrical 1-D figure (like IoU) to define this intersection? Or something more fine-grained?

What do you think?

frgfm avatar Jun 28 '22 16:06 frgfm

Hello, I'm not sure if I can help you with the implementation but in my option the easier and best solution is to crop the box that has the higher "horizontal_length / nb_chars". The reason for that is because the box that has included a char from another box has also included the space which means that it will probably hava a bigger length/nb_chars ratio.

zas97 avatar Jul 05 '22 08:07 zas97

It won't correct the boxes ("Hellow", "oworld") case but it will at least solve correctly the rest

zas97 avatar Jul 05 '22 08:07 zas97

Hello, I'm not sure if I can help you with the implementation but in my option the easier and best solution is to crop the box that has the higher "horizontal_length / nb_chars". The reason for that is because the box that has included a char from another box has also included the space which means that it will probably hava a bigger length/nb_chars ratio.

Just to make sure we are talking about the same process, could you illustrate almost programmatically your suggestion please? :pray:

Also, the denominator "nb_chars", you're talking about the predicted number of characters right?

frgfm avatar Jul 06 '22 18:07 frgfm

Yes nb_chars = the predicted number of characters

Lets say that we have the words "hello world" but the ocr recognize "hello oworld". You have two bounding boxes with same y and assuming that the width of the characters is always 1 the bounding boxes will be the following :

"hello": x1 = 0, x2 = 5 "oworld": x1 = 4, x2 = 11 (x1=4 because the bounding box starts between "hell" and "o" and includes the space between hello world)

For the first bb horizontal_length / nb_chars (5 - 0) / len("hello") = 1 For the first bb horizontal_length / nb_chars = (11 - 4) / len("oworld") = 1.16

That means that we should crop the second bounding box since it has the higher "horizontal_length / nb_chars".

This approach will work assuming that we are not in a case like ("Hellow", "oworld") and assuming that the x1 of the first word and the x2 of the second word are precise

zas97 avatar Jul 07 '22 09:07 zas97

Nice, however perhaps making the horizontal & character length of a word a priority, means that it will be a bit random in the case of :

  • "thew world" (first word is shorter, I agree there will be more space, but the localization is a prediction so it's not perfect if the "world" crop has some room in it, while "thew" is cropped before the end of the "w")

From a visual perspective, the only trustable cue I can see is the space between the overlapping character ("w" in my example) is not the same right vs left. Arguably, if the prediction has an error, it means that the vertical histogram distribution density of "thew" is heterogeneous while it isn't for the "world". Down to the earth, that would be the case if:

  • first bbox is very tight over "the w" and part of the "w" is cropped (density isn't uniform)
  • second bbox is quite loose including potential spaces left and right of the world " world " (density is uniform in the center)

Perhaps we could combine both criteria? Now if we go down that road, we need a proxy for this vertical histogram:

  • either actually computing the histogram

What do you think?

frgfm avatar Jul 07 '22 09:07 frgfm

Yeah you're right that my proposition will be random in too many cases, using an histogram will probably be better.

zas97 avatar Jul 07 '22 15:07 zas97

@zas97 @frgfm I have to throw a very naive approach into this topic 😅 If I am right we talk about overlapping segmentation masks .. currently there are really low threshold values (0.1-0.3) which should lead to 'bigger' boxes .. so the range to overlap is really high what if we increase this values .. normally i think it will lead to less overlap in fact that the size of the masks will decrease wdyt ? Anything like pred_mask = (pred_mask > 0.7) (self.bin_tresh>=0.7)

felixdittrich92 avatar Jul 07 '22 19:07 felixdittrich92

@felixdittrich92 circling back to this, what "low threshold values" were you referring to? Do you mean the binarization threshold or the expansion value for the base text detection segmentation?

frgfm avatar Sep 16 '22 09:09 frgfm

@frgfm the threshold value for prob_map where we decide 0 or 1 :sweat_smile: this should minimize distortion at the edges and thus reduce the overlap while converting to bbox (i have had a similar problem on semantic segmentation and this has solved my problem)

felixdittrich92 avatar Sep 16 '22 10:09 felixdittrich92

Binarization threshold then, we could increase the value but the manual tuning will never end either way :sweat_smile: This type of threshold should be a hyperparam tuned for each detection model :/

frgfm avatar Oct 15 '22 12:10 frgfm

That will be mostly fixed with the next release. For custom detection result manipulation we added an way to interact with the results in the middle of the pipeline before cropping and passing the crops to the recognition model Docs: https://mindee.github.io/doctr/latest/using_doctr/using_models.html#advanced-options

felixdittrich92 avatar Feb 12 '24 20:02 felixdittrich92