CLUECorpus2020 icon indicating copy to clipboard operation
CLUECorpus2020 copied to clipboard

Large-scale Pre-training Corpus for Chinese 100G 中文预训练语料

Results 10 CLUECorpus2020 issues
Sort by recently updated
recently updated
newest added

您好,社区互动-语料 webText2019zh_corpus语料库百度网盘链接已失效

试了包括unzip在内的几款解压软件,要么报memory不够,要么报类似于校验错误的错误,有伙计提供一下解压方案不

![image](https://user-images.githubusercontent.com/4702353/83238081-7ebbf600-a1c8-11ea-9b3f-73f91371782a.png) @brightmart 多谢!多谢!多谢! `sequence_len`是512吗? 只跑了125K step也就是12W step预训练?

CLUECorpus2020里包含那个small的14G吗?

Hi, I was having trouble downloading data from Baidu Wangpan (百度网盘.) It would be very useful if the data was also made available on another cloud service like GCS or...

common Crawl 中包含的网页数据里脏数据很多,需要进行细致的过滤才能获得纯净的中文文本。大佬我看您给出的技术文档里面说明了几种处理手段,但是较为笼统。不知道之后数据清洗的代码是否能开源呀。

如题,希望能够check一下邮箱,感谢大佬

您好,我邮件申请了大数据集,但是一直没得到回复,想知道是什么原因,如果不能免费开放,也可以考虑付费使用