How does custom merges file effect tokenizer ?

Open crazysal opened this issue 4 years ago • 0 comments

What is the default value used for the tokenizer-merges-file ?

Do you use the default merges_gpt2.txt or the custom digits removed file merges_gpt2_single_digit_numbers.txt

My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.

How di you verify for this design decision? I understand the need for the "clean" merges file, but using a new merges file with pre-trained weights, wouldn't that be an error ? Since now there are tokens missing as compared to what the gpt2 was trained with?

Or should one re-train the tokenizer itself, on the current dataset vocab ?

Jul 27 '21 22:07 crazysal