sentencepiece icon indicating copy to clipboard operation
sentencepiece copied to clipboard

Bug in BPE algorithm

Open xbelonogov opened this issue 7 years ago • 3 comments

I think the BPE algorithm is not working properly. This code snippet reproduces the bug.

import sentencepiece as spm

vocab_size= 9
model_prefix = 'model'
train_data_file = 'corpus.txt'
text = "bc a aaa"

with open(train_data_file, 'w') as fout:
    fout.write(text)

spm.SentencePieceTrainer.Train(f'--input={train_data_file} --model_prefix={model_prefix} --vocab_size={vocab_size} --model_type=bpe')

sp = spm.SentencePieceProcessor();
sp.Load(model_prefix + '.model')
for i in range(vocab_size):
    s = sp.IdToPiece(i)
    s = s.replace(chr(9601), '_')
    print(f'i: {i} piece: {s}') 

The input for BPE is "bc a aaa" I got the following output.

i: 0 piece: <unk>
i: 1 piece: <s>
i: 2 piece: </s>
i: 3 piece: _a
i: 4 piece: bc
i: 5 piece: a
i: 6 piece: _
i: 7 piece: b
i: 8 piece: c

The first merged pair should be _ + a = _a and it's correct. (i: 3 piece: _a) After that the second pair should be a + a = aa but algorithm produced pair bc. (i: 4 piece: bc) I used the debug output and found out that at the second iteration here symbol->freq for aa is 0 and symbol->freq for bc is 1.

It happened because logic in this if statement is incorrect. You can't simply remove this positions.

xbelonogov avatar Apr 17 '19 12:04 xbelonogov

Thank you for the report.

Do you think we can fix this issue just by removing the condition in the if statement? Let me run large test to make sure it biring no side effects. (In real corpus, I believe that it will not have huge different)

taku910 avatar Apr 24 '19 16:04 taku910

@xbelonogov , I agree that after the first merge, the count for "aa" should be 1. Indeed, if we note "X=aa", the corpus becomes "bcXXaa". Then both "bc" and "aa" have a count of 1, in fact, all the bigrams occur only once. So why would you prefer "aa" to be merged instead of "bc" as they should have the same counts? Thanks.

tombosc avatar Apr 22 '20 17:04 tombosc

I see that the v0.1.96 output is still "bc", not "aa". I put some logs to watch the process, the first iteration: symbol: ▁a freq: 2 symbol: aa freq: 1 symbol: ▁b freq: 1 symbol: bc freq: 1

▁a put into the final_pieces_, and the freq of "aa" reset to 0.

the second iteration: symbol: aa freq: 0 symbol: ▁b freq: 1 symbol: bc freq: 1 symbol: _aa freq: 1

sentencepiece put "bc" into the final_pieces.

Is there a problem with this process?

xiefangqi avatar Dec 07 '21 12:12 xiefangqi

Sorry for the late response. This bug is going to be fixed in the next release.

taku910 avatar Apr 24 '23 07:04 taku910

Fixed in https://github.com/google/sentencepiece/releases/tag/v0.1.99

taku910 avatar May 02 '23 04:05 taku910