yyht comments

Results 8 comments of


                                            yyht

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

ELECTRA的核心优势在pretraing的时候 disc能够利用所有token（比mlm只用15%的数据利用率更高），卖点主要在预训练（更少的迭代次数达到相当的结果）（参加table-6: electra 在不同的模型大小下，迭代次数少于通常的 roberta和bert 以及table2: train flop相当的情况下，效果更好； train flop为roberta的四分之一，效果基本持平或者更好） ![image](https://user-images.githubusercontent.com/14133687/77385372-e3229980-6dc2-11ea-8b65-533dd63ff683.png) 下游微调收敛快慢无所谓（lr大一些 epoch少一点，lr小，epoch大一些，反正拿dev 做验证就行）

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

而且，electra pretrain的指标很重要，如果 disc 的指标上不去，基本微调的效果也很差

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

这个还没有对比过，我可以上传 robert-tiny、electra-tiny 的中间checkpoint 用于验证，我个人时间、资源限制没有做过类似的实验

如何finetune自己的数据

bert_config_tiny.json 是 discriminator的参数配置文件（为了与同样规模的roberta-tiny对比，gen是disc的1/4） finetuning的时候直接使用 PyCLUE 包（使用官方bert源码，scope=‘electra’）以及把官方代码里面的layer-wise learning rate decay 加入对应的optimier即可

when apply rezero to bert or gpt, get NAN gradients

1. I initizlied \alpha to zero 2. the initialization are followed by official BERT initialization: ebmbedding matrix and kernel matrix are initialized via: def create_initializer(initializer_range=0.02): """Creates a `truncated_normal_initializer` with the...

yyht

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

ELECTRA tiny 是否观察到收敛快于 Roberta tiny 呢 (finetuning 任务上)？

如何finetune自己的数据

when apply rezero to bert or gpt, get NAN gradients

when train tta with bert-base config and sequence length 512,got NAN

I have done some experiments on Chinese using bert-base config, the results are not promising

have you any plan to realeas pretraining code with horovod.