datasets Request for text deduplication feature

Feature request

It would be great if there would be support for high performance, highly scalable text deduplication algorithms as part of the datasets library.

Motivation

Motivated by this blog post https://huggingface.co/blog/dedup and this library https://github.com/google-research/deduplicate-text-datasets, but slightly frustrated by how its not very easy to work with these tools I am proposing this feature.

Your contribution

I would be happy to contribute to the development effort of this feature. would love to collaborate with others in the development effort.

May 20 '23 01:05 SupreethRao99

The "exact match" deduplication will be possible when we resolve https://github.com/huggingface/datasets/issues/2514 (first, https://github.com/apache/arrow/issues/30950 needs to be addressed on the Arrow side). In the meantime, you can use Polars or DuckDB (e.g., via datasets-sql).

Fuzzy deduplication is out-of-scope for now (splink is probably the best tool for it).

May 31 '23 14:05 mariosasko

This library can be an intermediate solution : https://github.com/ChenghaoMou/text-dedup/tree/main

Jun 01 '23 20:06 MaveriQ

I have been using polars to remove duplicates but it would be nice to do it directly in pyarrow.

For example,

Read dataset with pyarrow
Use scan_pyarrow_dataset() with Polars to create a LazyFrame
Use sort and unique to remove duplicates based on a subset of columns
Convert to table and save data with ds.write_dataset()

There are times where that workflow makes perfect sense because I do additional transformations with Polars. Most of the time I am simply just reading dataset A and writing dataset B without duplicates though, and I wish I could use a pyarrow scanner or table directly.

Jul 26 '23 21:07 ldacey

Hi see this new release from hf datatrove DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality

Jan 25 '24 14:01 Manel-Hik