Different results from unique() and difference of deduplicated set
Dear all, I have a document set that returns a duplicate accorind to unique(): len(docset) -> 1014 len(docset.unique()) -> 1013 However, len(docset-docset.unique()) -> 0 I found this when I wanted to output the title of the duplicate that is supposedly eliminated by unique, however I do not get any since the difference has zero documents. Best, Lars.
Hi! Thanks for the bug report.
I've given this some careful thought, and although this behavior might seem counter-intutive, it is indeed correct.
The - operator relies on "fuzzy" matching to determine which documents from the left-hand set should be excluded, based on the right-hand set. In the case you described, where there are two identical documents, docset-docset.unique() results in an empty set. This happens because the "fuzzy" matching treats the same document as present in both sets (likely due to matching DOI).
Nonetheless, I can see how it is odd that there is no way to retrieve which documents were removed by unique.
Would it work for you if we were to add a duplicates() method? This method would specifically return the duplicate documents, ensuring that len(docset) = len(docset.unique()) + len(docset.duplicates()).
Hi, yes, that would help. It was exactly the idea - I just wanted to see what had been identified as duplicated. Best, Lars.