tokenizers
tokenizers copied to clipboard
unigram.json to transformers bert tokenizer
I tried converting the unigram.json file to transformers tokenizer.
Moreover I converted to tokenizer to BERT format. ([CLS] SENTENCE [SEP])
I share it because I think it will be helpful for people who have the same concerns as me.
- BERT tokenizer format
import json
from tokenizers import SentencePieceUnigramTokenizer
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast
tokens = list()
test_input = 'Hi?'
with open('unigram.json', encoding='utf-8-sig') as f:
json_file = json.load(f)
vocab = json_file['vocab']
for idx, v in enumerate(vocab):
vocab[idx] = tuple(v)
tokenizer = SentencePieceUnigramTokenizer(vocab)
# '[CLS] SENTENCE [SEP]' format
tokenizer.post_processor = TemplateProcessing(
single="[CLS] $A [SEP]",
pair="[CLS] $A [SEP] $B:1 [SEP]:1",
special_tokens=[
("[CLS]", tokenizer.token_to_id("[CLS]")),
("[SEP]", tokenizer.token_to_id("[SEP]")),
],
)
tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(tokenizer.tokenize(test_input, add_special_tokens=True))