doctr [datasets] Filter currupted and wrong annotated files in ready to use datasets

Bug description

While testing #933 i have seen some errors (empty crops / non utf-8 strings / and so on) We need to filter some invalid files/annotations

[ ] ensure all ready to use datasets works fine with eval_detection eval_recognition scripts (TF and PT)
[ ] unify recognition part with recognition dataset / word generator to return string directly instead of {labels: ['string']} #954

detection:

[ ] CORD (PT/TF) (NOTE: mem leak with --rotation)
[ ] FUNSD (PT/TF) (NOTE: mem leak with --rotation)
[x] IC03 (PT/TF) #983
[x] IC13 (PT/TF) validated by: @felixdittrich92
[x] IIIT5K (PT/TF) validated by: @felixdittrich92
[ ] IMGUR5K (PT/TF) (NOTE: mem leak with --rotation)
[ ] SROIE (PT/TF) (NOTE: mem leak with --rotation)
[x] SVHN (PT/TF) validated by: @felixdittrich92
[x] SVT (PT/TF) #955
[ ] SynthText (PT/TF) (NOTE: mem leak with --rotation)

recognition:

[x] MJSynth (PT/TF) #956
[x] CORD (PT/TF) #983
[x] FUNSD (PT/TF) #983
[x] IC03 (PT/TF) #983
[x] IC13 (PT/TF) validated by: @felixdittrich92
[x] IIIT5K (PT/TF) validated by: @felixdittrich92
[ ] IMGUR5K (PT/TF) (NOTE: same as SynthText)
[x] SROIE (PT/TF) #983 (NOTE: 99% contains whitespaces exluding not possible)
[x] SVHN (PT/TF) #987
[x] SVT (PT/TF) #955
[ ] SynthText (PT/TF) (NOTE: memory leak -> needs another solution instead of pickle)

May 31 '22 07:05 felixdittrich92

@frgfm (only fyi) If we use polygons there is any memory leak (seems not to come directly from the datasets) (detection: without use_polygons all datasets works fine)

[x] profile where it cames from

Jul 15 '22 09:07 felixdittrich92

@felixdittrich92 oh :/ On a specific dataset? during training?

Jul 20 '22 09:07 frgfm

@frgfm Unfortunately no it affects all detection datasets used with use_polygons=True. I only noticed it later because some datasets aren't that big and i haven't had this on track before :/

While testing with eval_detection scripts in both frameworks so it seems to come from any transforms to whole dataset init is fine

Jul 20 '22 09:07 felixdittrich92

Alright, let's get to the bottom of this :+1:

Jul 20 '22 09:07 frgfm

https://github.com/bloomberg/memray super useful to find the mem leak

Aug 12 '22 06:08 felixdittrich92

@frgfm mem leak is inside LocalizationConfusion

this function : iou_mat = polygon_iou(gts, preds, self.mask_shape, self.use_broadcasting)

Sep 02 '22 08:09 felixdittrich92

ok as summary to identify the mem leak:

datasets with rotation are fine
problem is the LocalizationConfusion
with broadcasting (default) it would kill every normal machine
without broadcasting it is ultra slow

def polygon_iou(
    polys_1: np.ndarray, polys_2: np.ndarray, mask_shape: Tuple[int, int], use_broadcasting: bool = False
) -> np.ndarray:
    """Computes the IoU between two sets of rotated bounding boxes

    Args:
        polys_1: rotated bounding boxes of shape (N, 4, 2)
        polys_2: rotated bounding boxes of shape (M, 4, 2)
        mask_shape: spatial shape of the intermediate masks
        use_broadcasting: if set to True, leverage broadcasting speedup by consuming more memory

    Returns:
        the IoU matrix of shape (N, M)
    """

    if polys_1.ndim != 3 or polys_2.ndim != 3:
        raise AssertionError("expects boxes to be in format (N, 4, 2)")

    iou_mat: np.ndarray = np.zeros((polys_1.shape[0], polys_2.shape[0]), dtype=np.float32)

    if polys_1.shape[0] > 0 and polys_2.shape[0] > 0:
        if use_broadcasting:
            masks_1 = rbox_to_mask(polys_1, shape=mask_shape)
            masks_2 = rbox_to_mask(polys_2, shape=mask_shape)
            iou_mat = mask_iou(masks_1, masks_2)
        else:
            # Save memory by doing the computation for each pair
            for idx, b1 in enumerate(polys_1):
                m1 = _rbox_to_mask(b1, mask_shape)
                for _idx, b2 in enumerate(polys_2):
                    m2 = _rbox_to_mask(b2, mask_shape)
                    iou_mat[idx, _idx] = np.logical_and(m1, m2).sum() / np.logical_or(m1, m2).sum()

    return iou_mat

Sep 02 '22 09:09 felixdittrich92