The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’?

Open HunYuanfeng opened this issue 5 years ago • 1 comments

For example: tokens: ['i', 'am', 'looking', 'for', 'a', 'restaurant', 'in', 'the', '[restaurant_area]', '.', 'postcode', 'type', 'phone', 'food', 'pricerange', 'address', 'area', 'name', 'id', 'reference']

input_ids: [8, 35, 51, 15, 12, 45, 18, 9, 67, 6, 89, 117, 68, 88, 3, 82, 70, 346, 281, 49, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

tokenizer.convert_id_to_tokens(input_ids)): i am looking for a restaurant in the [restaurant_area] . postcode type phone food [UNK] address area name id reference.

The Tokenizer takes the token ‘pricerange’ as ‘[UNK]’, so the training code might not work. Is it normal？Does the source code has something incorrect? I try to examine this issue by: tokenizer = Tokenizer(vocab, ivocab, False) print(tokenizer.vocab_len) # 3130 print(tokenizer.get_word_id('pricerange')) # 3 print(tokenizer.get_word(3)) # [UNK]

Sep 12 '20 04:09 HunYuanfeng

@HunYuanfeng According to the "data/vocab.json" file, 'pricerange' may be replaced to "[restaurant_pricerange]" in your sentence above.

Sep 27 '20 09:09 fasterbuild