AddedVocabulary does not play well with the Model
Current state
The AddedVocabulary adds new tokens on top of the Model, making the following assumption: "The Model will never change".
So, this makes a few things impossible:
- Re-training/changing the
Modelafter tokens were added - Adding tokens manually at the very beginning of the vocab, for example before training
- We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like #438
Goal
Make the AddedVocabulary more versatile and robust toward the change/update of the Model, and also capable of having tokens at the beginning and at the end of the vocabulary.
This would probably require that the AddedVocabulary is always the one doing the conversion tokens<->ids, providing the vocab etc. effectively letting us keep the Model unaware of its existence.
Sorry to have taken forever on this issue! I looked into it and I think, I understand the problem now more or less.
As I understood it, the assumption that "the model never changes" is cemented in this line: https://github.com/huggingface/tokenizers/blob/ae6534f12db021aeed9938d2a46b045dc85fdb77/tokenizers/src/tokenizer/added_vocabulary.rs#L252 => When adding a token to the AddedVocabulary the id corresponds to the size of the model's vocab at the time of adding the token, but cannot change anymore if the model's vocab changes afterward.
Currently, the class flow when doing the tokens<->ids conversion is implemented as follows (please correct me if I'm wrong). Ask TokenizerImpl to convert token_to_id. The TokenizerImpl then just forwards the call to its self.added_vocabulary, but also passes a read-only reference of self.model to self.added_vocabulary's token_to-id method. Then the AddedVocabulary checks if the token is in the Model's vocab (usually self.vocab) and if it's not checks in its own map.
I see three different approaches for now:
- As proposed, to make
AddedVocaburalymore versatile, instead of passingself.modelto theAddedVocabularywe could passself.model.vocabto theAddedVocabularyfor thetoken_to_idandid_to_tokenmethod. This way, we are always aware of the current vocab's size. Following this approach, I guess it makes more sense then to save "relative" token_ids inAddedVocabularyinstead of absolut ones for those tokens that are appended to the end of the vocabulary, no? Also, theadd_tokensmethod ofAddedVocabularywould not need a reference to the model anymore. Tokens that are appended to the beginning of the vocabulary could get absolute ids and tokens that are appended to the end could get relative ids => sotokens_to_idinAddedVocabularycould look something like:
token_to_id(token, vocab):
if token in vocab:
return vocab[token]
elif token in self.begin_added_tokens:
return self.begin_added_tokens[token]
elif token in self.end_added_tokens:
return len(vocab) + self.end_added_tokens[token]
As I understood it does it mean we should delete all token_to_id and id_to_token methods of the models? Since AddedVocabulary has the vocab we don't need it anymore no?
One small problem I see here is that not all modes have the same type of "vocabulary". BPE, Wordpiece and Wordlevel all have self.vocab and self.vocab_r, but Unigram seems to use a self.tokens_to_ids of type TokenMap instead of Vocab -> maybe change that for Unigram?
-
Or should we give
AddedVocabularya mutable reference to the vocab so that the vocab can be changed directly? Then we don't need additional "added_tokens_map" inAddedVocabulary. Approach 1) sounds better to me, but not 100% sure. -
Or a completely different method would be to let the
TokenizerImplhave more responsible fortoken_to_idso that theTokenizerImplclass actually checks if the token is in self.model.vocab_size and if not it just passes the size ofself.model.vocabto theAddedVocabulary'stoken_to_idmethod. This approach actually seems like the most logical to me since bothAddedVocabularyandModelare members ofTokenizerImpl, so that ideallyAddedVocabularyshould not handle anyModellogic but just get the minimal information frommodel.vocabit needs (which is just the size of the vocab IMO) fortoken_to_idfromTokenizerImpl.
Both 1) & 3) seem reasonable to me...would be super happy about some misunderstanding I have and your feedback @n1t0 :-)
Awesome! There's no rush so take the time you need!
I'll add a few pieces of information that may help with the decision:
- Every
Modelis different, and so we can't rely on their internals. The only thing we have to deal with all of them is the interface described in thetrait Model. Also, anyModelcan be used independently of theTokenizerImplso it makes sense to keep all these methods. - The
AddedVocabularyexists because we can't really modify the vocabulary of theModelin many cases. For example withBPE, we can't add a new token because it would require us to add all the necessary merge operations too, but we can't do it. Same withUnigram. So theAddedVocabularyworks by adding vocabulary on top of theModeland it has the responsibility of choosing which one to use.
So I think approach 1) is probably the most likely to work, but we'll have to stick with what's available on trait Model.
One thing that is very important and might be tricky to handle (I didn't dig this at all for now) is the serialization/deserialization of the TokenizerImpl and AddedVocabulary. We need to keep all the existing tokenizers out in the wild to keep working, while probably having to add some info to handle whether a token is at the beginning or the end of the vocabulary. (Will probably be necessary to add some tests there.) I'll be able to help on this side of course!
Hi @n1t0 @patrickvonplaten I just came across the third issue @n1t0 mentioned:
We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like
In my use case I want to replace all the numbers with an added token '[NUM]'. My code is as follows:
tokenizer = Tokenizer(Unigram())
NUM_token = AddedToken('[NUM]')
tokenizer.add_tokens([NUM_token])
# Normalization is applied in both training and encoding. For training see:
# https://github.com/huggingface/tokenizers/blob/adf90dcd722cbc20af930441988b317a19815878/tokenizers/src/tokenizer/mod.rs#L1028
tokenizer.normalizer = normalizers.Replace(Regex('[0-9]+'), '[NUM]')
trainer = UnigramTrainer()
tokenizer.train_from_iterator(iterator, trainer)
During encoding, extract_and_normalize of added_vocabulary is called which breaks the input sequence on added tokens so as to prevent tokenizing them, but during training such a thing does not take place and added tokens existing in corpus are used in model training.
The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.
The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.
I think you're spot on.
PRs are welcome on this, but as you might understand this is a very delicate change to operate, as backward compatibility is critical now for transformers to work properly. So this PR will take time no matter who takes up on it.
(Also we're scarce on time to dedicate on tokenizers too unfortunately, we still have to release 13.0 which is going ot happen as soon as I have time to dedicate to it).