datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Request for text deduplication feature

Open SupreethRao99 opened this issue 2 years ago • 4 comments

Feature request

It would be great if there would be support for high performance, highly scalable text deduplication algorithms as part of the datasets library.

Motivation

Motivated by this blog post https://huggingface.co/blog/dedup and this library https://github.com/google-research/deduplicate-text-datasets, but slightly frustrated by how its not very easy to work with these tools I am proposing this feature.

Your contribution

I would be happy to contribute to the development effort of this feature. would love to collaborate with others in the development effort.

SupreethRao99 avatar May 20 '23 01:05 SupreethRao99

The "exact match" deduplication will be possible when we resolve https://github.com/huggingface/datasets/issues/2514 (first, https://github.com/apache/arrow/issues/30950 needs to be addressed on the Arrow side). In the meantime, you can use Polars or DuckDB (e.g., via datasets-sql).

Fuzzy deduplication is out-of-scope for now (splink is probably the best tool for it).

mariosasko avatar May 31 '23 14:05 mariosasko

This library can be an intermediate solution : https://github.com/ChenghaoMou/text-dedup/tree/main

MaveriQ avatar Jun 01 '23 20:06 MaveriQ

I have been using polars to remove duplicates but it would be nice to do it directly in pyarrow.

For example,

  1. Read dataset with pyarrow
  2. Use scan_pyarrow_dataset() with Polars to create a LazyFrame
  3. Use sort and unique to remove duplicates based on a subset of columns
  4. Convert to table and save data with ds.write_dataset()

There are times where that workflow makes perfect sense because I do additional transformations with Polars. Most of the time I am simply just reading dataset A and writing dataset B without duplicates though, and I wish I could use a pyarrow scanner or table directly.

ldacey avatar Jul 26 '23 21:07 ldacey

Hi see this new release from hf datatrove DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality

Manel-Hik avatar Jan 25 '24 14:01 Manel-Hik