Request for text deduplication feature
Feature request
It would be great if there would be support for high performance, highly scalable text deduplication algorithms as part of the datasets library.
Motivation
Motivated by this blog post https://huggingface.co/blog/dedup and this library https://github.com/google-research/deduplicate-text-datasets, but slightly frustrated by how its not very easy to work with these tools I am proposing this feature.
Your contribution
I would be happy to contribute to the development effort of this feature. would love to collaborate with others in the development effort.
The "exact match" deduplication will be possible when we resolve https://github.com/huggingface/datasets/issues/2514 (first, https://github.com/apache/arrow/issues/30950 needs to be addressed on the Arrow side). In the meantime, you can use Polars or DuckDB (e.g., via datasets-sql).
Fuzzy deduplication is out-of-scope for now (splink is probably the best tool for it).
This library can be an intermediate solution : https://github.com/ChenghaoMou/text-dedup/tree/main
I have been using polars to remove duplicates but it would be nice to do it directly in pyarrow.
For example,
- Read dataset with pyarrow
- Use scan_pyarrow_dataset() with Polars to create a LazyFrame
- Use sort and unique to remove duplicates based on a subset of columns
- Convert to table and save data with ds.write_dataset()
There are times where that workflow makes perfect sense because I do additional transformations with Polars. Most of the time I am simply just reading dataset A and writing dataset B without duplicates though, and I wish I could use a pyarrow scanner or table directly.
Hi see this new release from hf datatrove DataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality