This model for MLM is waste of time, why did you even made it if it cannot be used?
I do not get why would you make a model even worse than a unigram model. And i read it is one of the best in glue task, but I do not see how because it predicts: The capital of France is plunge.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='deberta-base')
the_out = unmasker("The capital of France is [MASK].")
print("the_out",the_out)
As you can see the deberta results is completely wrong, there is some big error in porting it to transformers.
the_out [{'score': 0.001861382625065744, 'token': 18929, 'token_str': 'ABC', 'sequence': 'The capital of France isABC.'}, {'score': 0.0012871784856542945, 'token': 15804, 'token_str': ' plunge', 'sequence': 'The capital of France is plunge.'}, {'score': 0.001228992477990687, 'token': 47366, 'token_str': 'amaru', 'sequence': 'The capital of France isamaru.'}, {'score': 0.0010126306442543864, 'token': 46703, 'token_str': 'bians', 'sequence': 'The capital of France isbians.'}, {'score': 0.0008897537481971085, 'token': 43107, 'token_str': 'insured', 'sequence': 'The capital of France isinsured.'}]
In my opinion, Deberta MLM is used with EMD but transformers pipeline not use EMD code for MASK token prediction. So you cannot use the result produced by transformers fill-mask pipeline.
Might be some mistakes? Even using huggingface basic MLM pipline continue pretrain on deberta-v3-large works. Tested on kaggle nbme dataset.
If you use huggingface basic MLM pipline to continue pretrain on deberta-v3-large, the pretrained encoder weight is used to learn along with the prediction head layer weight which is newly initialized. So it works if you finetune with huggingface basic MLM pipline.
@chenweizhu
Hi, may I ask what is your setting for building the deberta-v3-large? I always got the same acc for about 0.04 for that model but fine with deberta-v3-base and deberta-base.
I used !python /content/drive/MyDrive/NBME/run_mlm.py --model_name_or_path microsoft/deberta-v3-large --train_file /content/drive/MyDrive/NBME/tapt_val.txt --per_device_train_batch_size 6 --per_device_eval_batch_size 6 --max_seq_length 951 --do_train --do_eval --fp16 --save_total_limit 5 --save_step=5000 --learning_rate 2e-4 --line_by_line --overwrite_output_dir --output_dir tmp/test-mlm
DeBERTa-v3 was trained with the replaced-token-detection objective and has never learned the MLM objective. So it's absolutely to be expected that it does not perform well on MLM, because that's not the purpose of v3. Read the paper before you complain: https://arxiv.org/abs/2111.09543
Our code for pre-training V3 updated.