Code-Pile icon indicating copy to clipboard operation
Code-Pile copied to clipboard

Data Processing

Open ncoop57 opened this issue 3 years ago • 6 comments

We should follow a similar process to the BigScience workshop's dataset processing. They include many of the tools ready for us to use such as data deduplication, both exact match and near dedup, filtering of low information content examples, removal of potentially hateful documents, and removal of PII.

They have all their tools available and discussions of them here: https://github.com/bigscience-workshop/data_tooling

Here is an initial set of tasks to perform:

  • [ ] Filtering of low quality documents
  • [ ] Filtering of documents with specific removal words
  • [ ] Filtering of exact duplicate content
  • [ ] Filtering of near duplicate content
  • [ ] Removal of PII

ncoop57 avatar Oct 09 '22 14:10 ncoop57

New repo: https://github.com/bigscience-workshop/data-preparation/tree/main/preprocessing/training

PhungVanDuy avatar Oct 10 '22 02:10 PhungVanDuy

@ncoop57 Filtering: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py Deduplicated: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/deduplicate/self_deduplicate.py PII: https://github.com/bigscience-workshop/data-preparation/blob/main/preprocessing/training/01b_oscar_cleaning_and_filtering/anonymization.py

everyone can use this for filtering and deduplication, I am writing a light version for it. My dataset is quite clean on the original so some function is not necessary but other people can get some useful function from there.

PhungVanDuy avatar Oct 10 '22 16:10 PhungVanDuy

@ncoop57 I created a simple one script to take the code from the source above for Wikibooks dataset, it just needs to get the parquet file and run filtering and dedup, I pushed it here https://github.com/PhungVanDuy/Code-Pile/tree/books_dataset/codepile/enwikibooks/data_process_pipeline.

Please review the threshold if you guys want to use it for your dataset.

PhungVanDuy avatar Oct 11 '22 19:10 PhungVanDuy

I propose a workflow like this: convert data to HuggingFace Dataset object our pipeline will process on this format after that: remove small documents/code (with words count), remove documents containing flagged words, PII, deduplicate (near-deduplicate) for both code and documents. Just found that BigCode also implements for PII and dedup: https://github.com/bigcode-project/bigcode-analysis/tree/main/data_analysis

PhungVanDuy avatar Oct 16 '22 16:10 PhungVanDuy

We will not dedup the following:

  1. The Stack - since it has already had dedup ran on it
  2. GitHub Diffs - as it would remove too many instances due to their short length and high overlap

ncoop57 avatar Nov 01 '22 15:11 ncoop57

We will use this lib: https://github.com/CarperAI/squeakily to manage the different filtering and cleaning steps on a per dataset basis. There will be a global set of filters and cleaners that will be applied to each dataset such as flagged word removal and a local set of filters and cleaners that are specific to each data source.

ncoop57 avatar Nov 07 '22 19:11 ncoop57