unigram.json to transformers bert tokenizer

Open sooftware opened this issue 4 years ago • 0 comments

I tried converting the unigram.json file to transformers tokenizer. Moreover I converted to tokenizer to BERT format. ([CLS] SENTENCE [SEP]) I share it because I think it will be helpful for people who have the same concerns as me.

BERT tokenizer format

import json
from tokenizers import SentencePieceUnigramTokenizer
from tokenizers.processors import TemplateProcessing
from transformers import PreTrainedTokenizerFast

tokens = list()
test_input = 'Hi?'

with open('unigram.json', encoding='utf-8-sig') as f:
    json_file = json.load(f)
    vocab = json_file['vocab']
    for idx, v in enumerate(vocab):
        vocab[idx] = tuple(v)
        
tokenizer = SentencePieceUnigramTokenizer(vocab)      

# '[CLS] SENTENCE [SEP]' format
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[
        ("[CLS]", tokenizer.token_to_id("[CLS]")),
        ("[SEP]", tokenizer.token_to_id("[SEP]")),
    ],
)

tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
print(tokenizer.tokenize(test_input, add_special_tokens=True))

Sep 03 '21 14:09 sooftware