RedPajama-Data
RedPajama-Data copied to clipboard
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Can you mention what languages covered in this dataset? based on the arXiv:2302.13971v1, LLaMA only covers this kind of languages : bg, ca, cs, da, de, en, es, fr, hr,...
if you test download.py in wikipedia folder, it will show an error ``` "name": "FileNotFoundError", "message": "Unable to resolve any data file that matches '['**']' at /storage/store/work/lgrinszt/memorization/the_pile with any supported...
This should fix issue #14 : - rename wikipedia to wiki - add data_dir as args - minor readability improvements
Can you share link of guide how to use this model ??
runnning -> running
First of all: thank you very much for your contribution! That said, I still have a question: in order to really "democratise" AI, a trained model will be needed that...
A question mark symbol (?) specifies an expression to its left for 0 (Zero) or 1 (one)times. Appending it after quantifier wont add any value to the match or group....
Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation? ``` (racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu usage: __main__.py [-h] [-c CONFIG_NAME]...
When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following: common_crawl/2023-06 common_crawl/2020-05 common_crawl/2021-04 common_crawl/2022-05 common_crawl/2019-30 What are the criteria for selection?...
First of all: thank you very much for your contribution! Many thanks if you can share the FastText ptrtrained model to classify each CommonCrawl webpage whether it is low quality...