How does custom merges file effect tokenizer ?
What is the default value used for the tokenizer-merges-file ?
Do you use the default merges_gpt2.txt or the custom digits removed file merges_gpt2_single_digit_numbers.txt
My understanding is that the file merges.txt is build during the training of the BBPE (Byte Level BPE) tokenizer on the corpus: it gets a new entry (line) at each iteration of the tokenizer to find the byte pairs most frequent.
How di you verify for this design decision? I understand the need for the "clean" merges file, but using a new merges file with pre-trained weights, wouldn't that be an error ? Since now there are tokens missing as compared to what the gpt2 was trained with?
Or should one re-train the tokenizer itself, on the current dataset vocab ?