Code-Pile Data Processing

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

[ ] Filtering of low quality documents
[ ] Filtering of documents with specific removal words
[ ] Filtering of exact duplicate content
[ ] Filtering of near duplicate content
[ ] Removal of PII

Oct 09 '22 14:10 ncoop57

New repo: https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training

Oct 10 '22 02:10 PhungVanDuy

@ncoop57 Filtering: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py Deduplicated: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/self_deduplicate.py PII: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/anonymization.py

everyone can use this for filtering and deduplication, I am writing a light version for it. My dataset is quite clean on the original so some function is not necessary but other people can get some useful function from there.

Oct 10 '22 16:10 PhungVanDuy

@ncoop57 I created a simple one script to take the code from the source above for Wikibooks dataset, it just needs to get the parquet file and run filtering and dedup, I pushed it here https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset/codepile/enwikibooks/data_process_pipeline.

Please review the threshold if you guys want to use it for your dataset.

Oct 11 '22 19:10 PhungVanDuy

I propose a workflow like this: convert data to HuggingFace Dataset object our pipeline will process on this format after that: remove small documents/code (with words count), remove documents containing flagged words, PII, deduplicate (near-deduplicate) for both code and documents. Just found that BigCode also implements for PII and dedup: https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis

Oct 16 '22 16:10 PhungVanDuy

We will not dedup the following:

The Stack - since it has already had dedup ran on it
GitHub Diffs - as it would remove too many instances due to their short length and high overlap

Nov 01 '22 15:11 ncoop57

We will use this lib: https://github.com/CarperAI/squeakily to manage the different filtering and cleaning steps on a per dataset basis. There will be a global set of filters and cleaners that will be applied to each dataset such as flagged word removal and a local set of filters and cleaners that are specific to each data source.

Nov 07 '22 19:11 ncoop57