Pretrained-Language-Model icon indicating copy to clipboard operation
Pretrained-Language-Model copied to clipboard

Wikipedia with proper data format

Open shairoz-deci opened this issue 4 years ago • 1 comments

Thank you for releasing and maintaining this repo. Can you please provide a link to the wikipedia dataset, and additional datasets required to train TinyBert from scratch, in the required format (textual with paragraph line break)? the one from https://dumps.wikimedia.org/enwiki/latest/ doesn't seem to have a line break between paragraphs.

Thanks in advance,

shairoz-deci avatar Aug 07 '21 06:08 shairoz-deci

Hi, you should preprocess the wikidata yourself.

zwjyyc avatar Sep 17 '21 02:09 zwjyyc