[datasets] Filter currupted and wrong annotated files in ready to use datasets
Bug description
While testing #933 i have seen some errors (empty crops / non utf-8 strings / and so on) We need to filter some invalid files/annotations
- [ ] ensure all ready to use datasets works fine with eval_detection eval_recognition scripts (TF and PT)
- [ ] unify recognition part with recognition dataset / word generator to return string directly instead of {labels: ['string']} #954
detection:
- [ ] CORD (PT/TF) (NOTE: mem leak with --rotation)
- [ ] FUNSD (PT/TF) (NOTE: mem leak with --rotation)
- [x] IC03 (PT/TF) #983
- [x] IC13 (PT/TF) validated by: @felixdittrich92
- [x] IIIT5K (PT/TF) validated by: @felixdittrich92
- [ ] IMGUR5K (PT/TF) (NOTE: mem leak with --rotation)
- [ ] SROIE (PT/TF) (NOTE: mem leak with --rotation)
- [x] SVHN (PT/TF) validated by: @felixdittrich92
- [x] SVT (PT/TF) #955
- [ ] SynthText (PT/TF) (NOTE: mem leak with --rotation)
recognition:
- [x] MJSynth (PT/TF) #956
- [x] CORD (PT/TF) #983
- [x] FUNSD (PT/TF) #983
- [x] IC03 (PT/TF) #983
- [x] IC13 (PT/TF) validated by: @felixdittrich92
- [x] IIIT5K (PT/TF) validated by: @felixdittrich92
- [ ] IMGUR5K (PT/TF) (NOTE: same as SynthText)
- [x] SROIE (PT/TF) #983 (NOTE: 99% contains whitespaces exluding not possible)
- [x] SVHN (PT/TF) #987
- [x] SVT (PT/TF) #955
- [ ] SynthText (PT/TF) (NOTE: memory leak -> needs another solution instead of pickle)
@frgfm (only fyi) If we use polygons there is any memory leak (seems not to come directly from the datasets) (detection: without use_polygons all datasets works fine)
- [x] profile where it cames from
@felixdittrich92 oh :/ On a specific dataset? during training?
@frgfm Unfortunately no it affects all detection datasets used with use_polygons=True.
I only noticed it later because some datasets aren't that big and i haven't had this on track before :/
While testing with eval_detection scripts in both frameworks so it seems to come from any transforms to whole dataset init is fine
Alright, let's get to the bottom of this :+1:
https://github.com/bloomberg/memray super useful to find the mem leak
@frgfm mem leak is inside LocalizationConfusion
this function : iou_mat = polygon_iou(gts, preds, self.mask_shape, self.use_broadcasting)
ok as summary to identify the mem leak:
- datasets with rotation are fine
- problem is the LocalizationConfusion
- with broadcasting (default) it would kill every normal machine
- without broadcasting it is ultra slow
def polygon_iou(
polys_1: np.ndarray, polys_2: np.ndarray, mask_shape: Tuple[int, int], use_broadcasting: bool = False
) -> np.ndarray:
"""Computes the IoU between two sets of rotated bounding boxes
Args:
polys_1: rotated bounding boxes of shape (N, 4, 2)
polys_2: rotated bounding boxes of shape (M, 4, 2)
mask_shape: spatial shape of the intermediate masks
use_broadcasting: if set to True, leverage broadcasting speedup by consuming more memory
Returns:
the IoU matrix of shape (N, M)
"""
if polys_1.ndim != 3 or polys_2.ndim != 3:
raise AssertionError("expects boxes to be in format (N, 4, 2)")
iou_mat: np.ndarray = np.zeros((polys_1.shape[0], polys_2.shape[0]), dtype=np.float32)
if polys_1.shape[0] > 0 and polys_2.shape[0] > 0:
if use_broadcasting:
masks_1 = rbox_to_mask(polys_1, shape=mask_shape)
masks_2 = rbox_to_mask(polys_2, shape=mask_shape)
iou_mat = mask_iou(masks_1, masks_2)
else:
# Save memory by doing the computation for each pair
for idx, b1 in enumerate(polys_1):
m1 = _rbox_to_mask(b1, mask_shape)
for _idx, b2 in enumerate(polys_2):
m2 = _rbox_to_mask(b2, mask_shape)
iou_mat[idx, _idx] = np.logical_and(m1, m2).sum() / np.logical_or(m1, m2).sum()
return iou_mat