Intermittent Tokenizer Test

Open mauicv opened this issue 3 years ago • 0 comments

This test fails intermittently in CI. The test checks if punctuation is not sampled but the tokenizer adds punctuation in certain cases which causes it to fail. For instance, in the following the tokenizer merges "do not" -> "don't".

from alibi.utils.lang_model import BertBaseUncased
tokenizer = BertBaseUncased().tokenizer
ids = [[2412, 2079, 2025]]
tokenizer.batch_decode(ids, skip_special_tokens=True)
["ever don't"]
ids = [[2079, 2025]]
tokenizer.batch_decode(ids, skip_special_tokens=True)
['do not']

May 16 '22 10:05 mauicv