kiseliu
kiseliu
除了论文中提到的 pre-norm 和 post-norm 的区别,以及 tokenizer 的区别, 我对比了下 plato 的网络结构 和 plato-2 (stage 2.1 PLATO模型) 的网络结构,发现也有细微区别: 1、在预测 latent variable 的时候,plato 1 中的实现的是 mask token 的 final hidden state 经过 post_network;而plato-2...
For the following parameters in the config of https://github.com/PaddlePaddle/Knover/blob/develop/projects/PLATO-2/pretrain/24L_infer.conf: ``` 16 init_params="./24L/Plato" 17 nsp_init_params="./24L/NSP" ``` How can I get these two models? Do I need transform from the model https://dialogue.bj.bcebos.com/Knover/projects/PLATO-2/24L.tar...
A month ago, I train the alpaca with 4 A100 GPUs (each 80G) and `per_device_train_batch_size=4`. Here `transformers==4.28.1`. Today I retrain the alpaca with the same hardwares and the same code,...
Thanks for your amazing work. Since I am not very familiar with the memory usage computation, I want to know if you can put more details about the `Table 1`...
Hi, thanks for sharing this code base. After I run the script of `bash scripts/run_pile.sh`, I obtain the following results: The generated domain reweights have slightly differences from the released...