Is biogpt's tokenizer bugged?
System Info
-
transformersversion: 4.27.1 - Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.17
- Python version: 3.8.13
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.13.1 (True)
Who can help?
@ArthurZucker and @younesbelkada could you please confirm this behavior is intended? Sorry if I mistagged. Thanks!
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
from transformers import AutoTokenizer
tokenizer_name = "microsoft/BioGPT-Large"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
print('bos token: ', tokenizer.bos_token, 'id: ', tokenizer.bos_token_id)
print('eos token: ', tokenizer.eos_token, 'id: ', tokenizer.eos_token_id)
print('token ids: ', tokenizer("this is a test")['input_ids'])
print('tokens: ', tokenizer.decode(tokenizer("this is a test")['input_ids']))
Output:
bos token: <s> id: 0
eos token: </s> id: 2
token ids: [2, 54, 34, 21, 229]
tokens: </s>this is a test
Expected behavior
I would expect the tokenizer to prepend the BOS token (i.e. 0) and append the EOS token (i.e. 2) while currently the tokenizer prepends the EOS token, and does not add a special token to the end of the sequence of tokens.
@fedshyvana I believe this is how biogpt is trained on fairseq . For more information , you check into official repo of BioGpt.
@upjabir thanks for pointing it out! I am looking at https://github.com/microsoft/BioGPT/blob/main/src/language_model_prompt_dataset.py which I believe is the code you're referring to. If I understand correctly, they use: [EOS] token_1, ..., token_n as input and token_1, ..., token_n [EOS] as target
i.e. it seems like they just don't use a separate BOS token at all. But in the HF BioGPT model config it says: "bos_token_id": 0 "eos_token_id": 2
Should we change it to: "bos_token_id": 2 "eos_token_id": 2
Or would it not make any difference at all? Thank you!
@fedshyvana bos_token_id , eos_token_id is added to vocabulary as we always do for every tokenizer.But during building inputs with special tokens, we are only considering eos_token_id . Although we are not using bos_token during handling special token , i believe it will helpful in some rare case