litstudy icon indicating copy to clipboard operation
litstudy copied to clipboard

Different results from unique() and difference of deduplicated set

Open larsgrobe opened this issue 2 years ago • 2 comments

Dear all, I have a document set that returns a duplicate accorind to unique(): len(docset) -> 1014 len(docset.unique()) -> 1013 However, len(docset-docset.unique()) -> 0 I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents. Best, Lars.

larsgrobe avatar Dec 08 '23 14:12 larsgrobe

Hi! Thanks for the bug report.

I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.

The - operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique() results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).

Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique.

Would it work for you if we were to add a duplicates() method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates()).

stijnh avatar Dec 12 '23 13:12 stijnh

Hi, yes, that would help. It was exactly the idea - I just wanted to see what had been identified as duplicated. Best, Lars.

larsgrobe avatar Dec 12 '23 19:12 larsgrobe