RemoteCLIP icon indicating copy to clipboard operation
RemoteCLIP copied to clipboard

data leakage issue of RSICD and RSITMD

Open YiguoHe opened this issue 1 year ago • 3 comments

The RSITMD and RSICD datasets have a data leakage issue where they might share some common images and descriptions. how to deal with it properly?

YiguoHe avatar Apr 02 '24 08:04 YiguoHe

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

gzqy1026 avatar Apr 12 '24 11:04 gzqy1026

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

Thank you for your response. Your work is excellent. Best wishes!

YiguoHe avatar May 22 '24 16:05 YiguoHe