transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Is biogpt's tokenizer bugged?

Open fedshyvana opened this issue 2 years ago • 3 comments

System Info

  • transformers version: 4.27.1
  • Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.17
  • Python version: 3.8.13
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.13.1 (True)

Who can help?

@ArthurZucker and @younesbelkada could you please confirm this behavior is intended? Sorry if I mistagged. Thanks!

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer_name = "microsoft/BioGPT-Large"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
print('bos token: ', tokenizer.bos_token, 'id: ', tokenizer.bos_token_id)
print('eos token: ', tokenizer.eos_token, 'id: ', tokenizer.eos_token_id)
print('token ids: ', tokenizer("this is a test")['input_ids'])
print('tokens: ', tokenizer.decode(tokenizer("this is a test")['input_ids']))

Output:

bos token:  <s> id:  0
eos token:  </s> id:  2
token ids:  [2, 54, 34, 21, 229]
tokens:  </s>this is a test

Expected behavior

I would expect the tokenizer to prepend the BOS token (i.e. 0) and append the EOS token (i.e. 2) while currently the tokenizer prepends the EOS token, and does not add a special token to the end of the sequence of tokens.

fedshyvana avatar Mar 21 '23 03:03 fedshyvana

@fedshyvana I believe this is how biogpt is trained on fairseq . For more information , you check into official repo of BioGpt.

upjabir avatar Mar 23 '23 09:03 upjabir

@upjabir thanks for pointing it out! I am looking at https://github.com/microsoft/BioGPT/blob/main/src/language_model_prompt_dataset.py which I believe is the code you're referring to. If I understand correctly, they use: [EOS] token_1, ..., token_n as input and token_1, ..., token_n [EOS] as target

i.e. it seems like they just don't use a separate BOS token at all. But in the HF BioGPT model config it says: "bos_token_id": 0 "eos_token_id": 2

Should we change it to: "bos_token_id": 2 "eos_token_id": 2

Or would it not make any difference at all? Thank you!

fedshyvana avatar Mar 23 '23 13:03 fedshyvana

@fedshyvana bos_token_id , eos_token_id is added to vocabulary as we always do for every tokenizer.But during building inputs with special tokens, we are only considering eos_token_id . Although we are not using bos_token during handling special token , i believe it will helpful in some rare case

upjabir avatar Mar 24 '23 12:03 upjabir