alibi
alibi copied to clipboard
Intermittent Tokenizer Test
This test fails intermittently in CI. The test checks if punctuation is not sampled but the tokenizer adds punctuation in certain cases which causes it to fail. For instance, in the following the tokenizer merges "do not" -> "don't".
from alibi.utils.lang_model import BertBaseUncased
tokenizer = BertBaseUncased().tokenizer
ids = [[2412, 2079, 2025]]
tokenizer.batch_decode(ids, skip_special_tokens=True)
["ever don't"]
ids = [[2079, 2025]]
tokenizer.batch_decode(ids, skip_special_tokens=True)
['do not']