Bug in BPE algorithm
I think the BPE algorithm is not working properly. This code snippet reproduces the bug.
import sentencepiece as spm
vocab_size= 9
model_prefix = 'model'
train_data_file = 'corpus.txt'
text = "bc a aaa"
with open(train_data_file, 'w') as fout:
fout.write(text)
spm.SentencePieceTrainer.Train(f'--input={train_data_file} --model_prefix={model_prefix} --vocab_size={vocab_size} --model_type=bpe')
sp = spm.SentencePieceProcessor();
sp.Load(model_prefix + '.model')
for i in range(vocab_size):
s = sp.IdToPiece(i)
s = s.replace(chr(9601), '_')
print(f'i: {i} piece: {s}')
The input for BPE is "bc a aaa"
I got the following output.
i: 0 piece: <unk>
i: 1 piece: <s>
i: 2 piece: </s>
i: 3 piece: _a
i: 4 piece: bc
i: 5 piece: a
i: 6 piece: _
i: 7 piece: b
i: 8 piece: c
The first merged pair should be _ + a = _a and it's correct. (i: 3 piece: _a)
After that the second pair should be a + a = aa but algorithm produced pair bc. (i: 4 piece: bc)
I used the debug output and found out that at the second iteration here symbol->freq for aa is 0 and symbol->freq for bc is 1.
It happened because logic in this if statement is incorrect. You can't simply remove this positions.
Thank you for the report.
Do you think we can fix this issue just by removing the condition in the if statement? Let me run large test to make sure it biring no side effects. (In real corpus, I believe that it will not have huge different)
@xbelonogov , I agree that after the first merge, the count for "aa" should be 1. Indeed, if we note "X=aa", the corpus becomes "bcXXaa". Then both "bc" and "aa" have a count of 1, in fact, all the bigrams occur only once. So why would you prefer "aa" to be merged instead of "bc" as they should have the same counts? Thanks.
I see that the v0.1.96 output is still "bc", not "aa". I put some logs to watch the process, the first iteration: symbol: ▁a freq: 2 symbol: aa freq: 1 symbol: ▁b freq: 1 symbol: bc freq: 1
▁a put into the final_pieces_, and the freq of "aa" reset to 0.
the second iteration: symbol: aa freq: 0 symbol: ▁b freq: 1 symbol: bc freq: 1 symbol: _aa freq: 1
sentencepiece put "bc" into the final_pieces.
Is there a problem with this process?
Sorry for the late response. This bug is going to be fixed in the next release.
Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99