Citing this resource
Hello, We are using this resource to filter pretraining data for our current project, and we would love to know if and how it should be cited. Thanks :)
Hi Yuval, There is no paper describing this repository (yet... it's in progress). In the meantime, you can refer to the official website presenting the project (https://bigscience.huggingface.co/) or to the communication workshop that will be held in ACL very soon: https://bigscience.huggingface.co/acl-2022
BTW we are working on "tiding-up" the tools provided in the projact and would be interested to know more about how you used these. Any chance to geet extra details on your project?
Sure, we are pretraining transformer encoder-decoder models using large corpora (the Pile, Wikipedia, and RealNews), and used the modification and filtering tools to clean up the data (English only, not multi-lingual).
@yuvalkirstain, we are planning to get a Zenodo DOI for this GitHub repository.