tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

unigram.json to transformers bert tokenizer

Open sooftware opened this issue 4 years ago • 0 comments

I tried converting the unigram.json file to transformers tokenizer. Moreover I converted to tokenizer to BERT format. ([CLS] SENTENCE [SEP]) I share it because I think it will be helpful for people who have the same concerns as me.

  • BERT tokenizer format
import json
from tokenizers import SentencePieceUnigramTokenizer
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast

tokens = list()
test_input = 'Hi?'

with open('unigram.json', encoding='utf-8-sig') as f:
    json_file = json.load(f)
    vocab = json_file['vocab']
    for idx, v in enumerate(vocab):
        vocab[idx] = tuple(v)
        
tokenizer = SentencePieceUnigramTokenizer(vocab)      

# '[CLS] SENTENCE [SEP]' format
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(tokenizer.tokenize(test_input, add_special_tokens=True))

sooftware avatar Sep 03 '21 14:09 sooftware