tokenizer icon indicating copy to clipboard operation
tokenizer copied to clipboard

endOfWordSuffix does not take effect in BPE model

Open season-studio opened this issue 5 months ago • 0 comments

If the endOfWordSuffix is not nil, the BPE tokenizer will not add the endOfWordSuffix at the end of the word.

For example: If we work in python, the "Hello World!" will be splited into tokens as ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>']. But if we work in this project, the result will be ['<|startoftext|>', 'hello', 'world', '!', '<|endoftext|>'].

This bug occurs in the method MergeWord of BPE. The line currRuneIdx++ appears after the if currRuneIdx == len(chars) branch, so that the if currRuneIdx == len(chars) branch will never toke effect. And the if currRuneIdx == len(chars) branch is after the if byteIdx == 0 branch, that results in single-letter words never being combined with "endOfWordSuffix".

season-studio avatar Sep 02 '25 06:09 season-studio