masader icon indicating copy to clipboard operation
masader copied to clipboard

[Error] Duplicated dataset with missing download link

Open AMR-KELEG opened this issue 2 years ago • 3 comments

Describe the dataset error

Hi,

I was checking datasets on the great Masader site and found that two datasets are the exact duplicates, and unfortunately, the download link on the provided site is unavailable. I am mainly interested in discussing ideas for automatically detecting duplicated entries on Masader. Thanks for taking the time to read my suggestion, and reviewing this issue!

Additional context

  • https://arbml.github.io/masader/card?id=25
  • https://arbml.github.io/masader/card?id=132

AMR-KELEG avatar Sep 14 '23 12:09 AMR-KELEG

Thank you @AMR-KELEG for the report. I removed the duplicate and it should be updated soon. In the past, we have done a duplication removal using embeddings which fixed a lot of the duplicates. Let me know if you have other ideas. All the metadata is accessible on HuggingFace https://huggingface.co/datasets/arbml/masader.

zaidalyafeai avatar Sep 15 '23 09:09 zaidalyafeai

Thanks @zaidalyafeai The current method you use sounds reasonable, and I do not think I have a better idea. On another hand, do you think we can have a way for reporting if some datasets are not accessible anymore?

AMR-KELEG avatar Sep 15 '23 19:09 AMR-KELEG

The status of datasets change a lot. It is difficult to keep track. We have a report feature that can be used.

zaidalyafeai avatar Sep 15 '23 21:09 zaidalyafeai