endOfWordSuffix does not take effect in BPE model

Open season-studio opened this issue 5 months ago • 0 comments

If the endOfWordSuffix is not nil, the BPE tokenizer will not add the endOfWordSuffix at the end of the word.

This bug occurs in the method MergeWord of BPE. The line currRuneIdx++ appears after the if currRuneIdx == len(chars) branch, so that the if currRuneIdx == len(chars) branch will never toke effect. And the if currRuneIdx == len(chars) branch is after the if byteIdx == 0 branch, that results in single-letter words never being combined with "endOfWordSuffix".

Sep 02 '25 06:09 season-studio