Pretrained-Language-Model
Pretrained-Language-Model copied to clipboard
Wikipedia with proper data format
Thank you for releasing and maintaining this repo. Can you please provide a link to the wikipedia dataset, and additional datasets required to train TinyBert from scratch, in the required format (textual with paragraph line break)? the one from https://dumps.wikimedia.org/enwiki/latest/ doesn't seem to have a line break between paragraphs.
Thanks in advance,
Hi, you should preprocess the wikidata yourself.