RedPajama-Data issues

Languages

4

Can you mention what languages covered in this dataset? based on the arXiv:2302.13971v1, LLaMA only covers this kind of languages : bg, ca, cs, da, de, en, es, fr, hr,...

firqaaa

if you test download.py in wikipedia folder, it will show an error ``` "name": "FileNotFoundError", "message": "Unable to resolve any data file that matches '['**']' at /storage/store/work/lgrinszt/memorization/the_pile with any supported...

abdoelsayed2016

fixes issue #14 (wiki)

This should fix issue #14 : - rename wikipedia to wiki - add data_dir as args - minor readability improvements

mauriceweber

Guide how to use

1

Can you share link of guide how to use this model ??

kamalkech

Fix typo in github/README.md

runnning -> running

eltociear

will there be a trained model?

First of all: thank you very much for your contribution! That said, I still have a question: in order to really "democratise" AI, a trained model will be needed that...

rozek

Corrected matching and grouping pattern

A question mark symbol (?) specifies an expression to its left for 0 (Zero) or 1 (one)times. Appending it after quantifier wont add any value to the match or group....

hiteshbedre

Got error while runing `python -m cc_net -l my -l gu`

8

Following example at https://github.com/togethercomputer/RedPajama-Data/tree/main/data_prep/cc/cc_net#pipeline-overview but got following error. Did I forget to run any preparation? ``` (racoon) t@medu:~/repos/NAM/red-pajama/data_prep/cc/cc_net$ python -m cc_net -l my -l gu usage: __main__.py [-h] [-c CONFIG_NAME]...

tiendung

How the 5 dumps of Common Crawl are selected?

1

When exploring the RedPajama dataset, I found that you have selected five dumps of Common Crawl as the following: common_crawl/2023-06 common_crawl/2020-05 common_crawl/2021-04 common_crawl/2022-05 common_crawl/2019-30 What are the criteria for selection?...

Stanislas0

where is the FastText ptrtrained model to classify each CommonCrawl webpage

2

First of all: thank you very much for your contribution! Many thanks if you can share the FastText ptrtrained model to classify each CommonCrawl webpage whether it is low quality...

yuhai-china

RedPajama-Data
RedPajama-Data copied to clipboard

Metadata

Languages

change wikipedia folder name

fixes issue #14 (wiki)

Guide how to use

Fix typo in github/README.md

will there be a trained model?

Corrected matching and grouping pattern

Got error while runing `python -m cc_net -l my -l gu`

How the 5 dumps of Common Crawl are selected?

where is the FastText ptrtrained model to classify each CommonCrawl webpage

← Metadata

Owner

Metadata

RedPajama-Data RedPajama-Data copied to clipboard

Metadata

← Metadata

Owner

Metadata

RedPajama-Data
RedPajama-Data copied to clipboard