Special token gets tokenized while training tokenizer from scratch
@ArthurZucker I am trying to train a bytepiece tokenizer on my dataset. I have a list of words which I want to be treated as a single token. But when I train it and tokenize, I observe that the token gets split in tow parts. My end goal is to train a Roberta LM on my dataset. `from tokenizers import BertWordPieceTokenizer, ByteLevelBPETokenizer
files = 'file.txt'
tokenizer = ByteLevelBPETokenizer( lowercase=True, )
tokenizer.train( files, vocab_size=100000, min_frequency=5, show_progress=True, special_tokens=["", "", "
tokenizer.save_model('bpe_piece') `
Test the tokenizer:
from transformers import RobertaTokenizer tokenizer = RobertaTokenizer.from_pretrained('bpe_piece') print(tokenizer.tokenize('an bokchoy auto_part))
Output should be ['an', 'bokchoy', 'auto_part']
But instead the output is ['an', 'Ä bok', 'choy', 'Ä auto', '_', 'part']
@ArthurZucker