data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Citing this resource

Open yuvalkirstain opened this issue 3 years ago • 4 comments

Hello, We are using this resource to filter pretraining data for our current project, and we would love to know if and how it should be cited. Thanks :)

yuvalkirstain avatar Apr 26 '22 17:04 yuvalkirstain

Hi Yuval, There is no paper describing this repository (yet... it's in progress). In the meantime, you can refer to the official website presenting the project (https://bigscience.huggingface.co/) or to the communication workshop that will be held in ACL very soon: https://bigscience.huggingface.co/acl-2022

ggdupont avatar Apr 27 '22 07:04 ggdupont

BTW we are working on "tiding-up" the tools provided in the projact and would be interested to know more about how you used these. Any chance to geet extra details on your project?

ggdupont avatar Apr 27 '22 07:04 ggdupont

Sure, we are pretraining transformer encoder-decoder models using large corpora (the Pile, Wikipedia, and RealNews), and used the modification and filtering tools to clean up the data (English only, not multi-lingual).

yuvalkirstain avatar Apr 27 '22 07:04 yuvalkirstain

@yuvalkirstain, we are planning to get a Zenodo DOI for this GitHub repository.

albertvillanova avatar May 04 '22 06:05 albertvillanova