visuallayer icon indicating copy to clipboard operation
visuallayer copied to clipboard

Feature request: RVL_CDIP and DocLayNet

Open Jordy-VL opened this issue 2 years ago • 4 comments

I would like to use your tool to investigate data noise in https://huggingface.co/datasets/aharley/rvl_cdip and https://ds4sd.github.io/icdar23-doclaynet/

It is known in the literature already that there is plenty of noise in RVL_CDIP, yet your tool could provide more quantitative insight.

Jordy-VL avatar Jun 20 '23 08:06 Jordy-VL

RVL_CDIP has the issue of being 400K images and annotations would need to change to COCO format. It would be a great contribution to the document AI community if you could showcase this dataset's quality issues with your tool ;)

Jordy-VL avatar Jun 20 '23 08:06 Jordy-VL

Hi @Jordy-VL thank you for the comment. We will add this to our roadmap. In the meantime, you can also try it out yourself using our no-code platform here for free.

Or if you're feeling adventurous to run some code, try using fastdup.

dnth avatar Jun 20 '23 13:06 dnth

Hi @dnth!

I just wanted to let you know that I was able to run fastdup on RVL-CDIP with the following results:

2023-06-22 11:56:43 [INFO] Found a total of 35106 fully identical images (d>0.990), which are 4.39 %
2023-06-22 11:56:43 [INFO] Found a total of 188747 nearly identical images(d>0.980), which are 23.59 %
2023-06-22 11:56:43 [INFO] Found a total of 769216 above threshold images (d>0.900), which are 96.15 %
2023-06-22 11:56:43 [INFO] Found a total of 40079 outlier images         (d<0.050), which are 5.01 %
2023-06-22 11:56:43 [INFO] Min distance found 0.684 max distance 1.000

Sharing the analysis htmls here: analysis

I do believe that this shows the usefulness of your tools on this dataset, requiring further visual inspection with the visual-layer tool :)

Jordy-VL avatar Jul 03 '23 13:07 Jordy-VL

Helly @Jordy-VL ! That's mindblowing how many duplicates are in the dataset! I think this would be very helpful to the community that works with this dataset. Thank you for sharing it :)

dnth avatar Jul 04 '23 01:07 dnth