Special token gets tokenized while training tokenizer from scratch

Open LalchandPandia opened this issue 1 year ago • 1 comments

@ArthurZucker I am trying to train a bytepiece tokenizer on my dataset. I have a list of words which I want to be treated as a single token. But when I train it and tokenize, I observe that the token gets split in tow parts. My end goal is to train a Roberta LM on my dataset. `from tokenizers import BertWordPieceTokenizer, ByteLevelBPETokenizer

files = 'file.txt'

tokenizer = ByteLevelBPETokenizer( lowercase=True, ) tokenizer.train( files, vocab_size=100000, min_frequency=5, show_progress=True, special_tokens=["~~", "~~", "", "", "", "auto_part", "bokchoy"], )

tokenizer.save_model('bpe_piece') `

Test the tokenizer: from transformers import RobertaTokenizer tokenizer = RobertaTokenizer.from_pretrained('bpe_piece') print(tokenizer.tokenize('an bokchoy auto_part)) Output should be ['an', 'bokchoy', 'auto_part'] But instead the output is ['an', 'Ġbok', 'choy', 'Ġauto', '_', 'part']

Sep 02 '24 14:09 LalchandPandia

@ArthurZucker

Sep 02 '24 14:09 LalchandPandia