tokenizers AddedVocabulary does not play well with the Model

Current state

The AddedVocabulary adds new tokens on top of the Model, making the following assumption: "The Model will never change". So, this makes a few things impossible:

Re-training/changing the Model after tokens were added
Adding tokens manually at the very beginning of the vocab, for example before training
We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like #438

Goal

Make the AddedVocabulary more versatile and robust toward the change/update of the Model, and also capable of having tokens at the beginning and at the end of the vocabulary. This would probably require that the AddedVocabulary is always the one doing the conversion tokens<->ids, providing the vocab etc. effectively letting us keep the Model unaware of its existence.

Nov 13 '20 17:11 n1t0

Sorry to have taken forever on this issue! I looked into it and I think, I understand the problem now more or less.

As I understood it, the assumption that "the model never changes" is cemented in this line: https://github.com/huggingface/tokenizers/blob/ae6534f12db021aeed9938d2a46b045dc85fdb77/tokenizers/src/tokenizer/added_vocabulary.rs#L252 => When adding a token to the AddedVocabulary the id corresponds to the size of the model's vocab at the time of adding the token, but cannot change anymore if the model's vocab changes afterward.

Currently, the class flow when doing the tokens<->ids conversion is implemented as follows (please correct me if I'm wrong). Ask TokenizerImpl to convert token_to_id. The TokenizerImpl then just forwards the call to its self.added_vocabulary, but also passes a read-only reference of self.model to self.added_vocabulary's token_to-id method. Then the AddedVocabulary checks if the token is in the Model's vocab (usually self.vocab) and if it's not checks in its own map.

I see three different approaches for now:

As proposed, to make AddedVocaburaly more versatile, instead of passing self.model to the AddedVocabulary we could pass self.model.vocab to the AddedVocabulary for the token_to_id and id_to_token method. This way, we are always aware of the current vocab's size. Following this approach, I guess it makes more sense then to save "relative" token_ids in AddedVocabulary instead of absolut ones for those tokens that are appended to the end of the vocabulary, no? Also, the add_tokens method of AddedVocabulary would not need a reference to the model anymore. Tokens that are appended to the beginning of the vocabulary could get absolute ids and tokens that are appended to the end could get relative ids => so tokens_to_id in AddedVocabulary could look something like:

token_to_id(token, vocab):
     if token in vocab:
        return vocab[token]
     elif token in self.begin_added_tokens:
        return self.begin_added_tokens[token]
     elif token in self.end_added_tokens:
        return len(vocab) + self.end_added_tokens[token]

As I understood it does it mean we should delete all token_to_id and id_to_token methods of the models? Since AddedVocabulary has the vocab we don't need it anymore no?

One small problem I see here is that not all modes have the same type of "vocabulary". BPE, Wordpiece and Wordlevel all have self.vocab and self.vocab_r, but Unigram seems to use a self.tokens_to_ids of type TokenMap instead of Vocab -> maybe change that for Unigram?

Or should we give AddedVocabulary a mutable reference to the vocab so that the vocab can be changed directly? Then we don't need additional "added_tokens_map" in AddedVocabulary. Approach 1) sounds better to me, but not 100% sure.
Or a completely different method would be to let the TokenizerImpl have more responsible for token_to_id so that the TokenizerImpl class actually checks if the token is in self.model.vocab_size and if not it just passes the size of self.model.vocab to the AddedVocabulary's token_to_id method. This approach actually seems like the most logical to me since both AddedVocabulary and Model are members of TokenizerImpl, so that ideally AddedVocabulary should not handle any Model logic but just get the minimal information from model.vocab it needs (which is just the size of the vocab IMO) for token_to_id from TokenizerImpl.

Both 1) & 3) seem reasonable to me...would be super happy about some misunderstanding I have and your feedback @n1t0 :-)

Jan 06 '21 11:01 patrickvonplaten

Awesome! There's no rush so take the time you need!

I'll add a few pieces of information that may help with the decision:

Every Model is different, and so we can't rely on their internals. The only thing we have to deal with all of them is the interface described in the trait Model. Also, any Model can be used independently of the TokenizerImpl so it makes sense to keep all these methods.
The AddedVocabulary exists because we can't really modify the vocabulary of the Model in many cases. For example with BPE, we can't add a new token because it would require us to add all the necessary merge operations too, but we can't do it. Same with Unigram. So the AddedVocabulary works by adding vocabulary on top of the Model and it has the responsibility of choosing which one to use.

So I think approach 1) is probably the most likely to work, but we'll have to stick with what's available on trait Model.

One thing that is very important and might be tricky to handle (I didn't dig this at all for now) is the serialization/deserialization of the TokenizerImpl and AddedVocabulary. We need to keep all the existing tokenizers out in the wild to keep working, while probably having to add some info to handle whether a token is at the beginning or the end of the vocabulary. (Will probably be necessary to add some tests there.) I'll be able to help on this side of course!

Jan 06 '21 14:01 n1t0

Hi @n1t0 @patrickvonplaten I just came across the third issue @n1t0 mentioned:

We can't extract these added tokens during the pre-processing step of the training, which would be desirable to fix issues like

In my use case I want to replace all the numbers with an added token '[NUM]'. My code is as follows:

tokenizer = Tokenizer(Unigram())

NUM_token = AddedToken('[NUM]')
tokenizer.add_tokens([NUM_token])

# Normalization is applied in both training and encoding. For training see:
# https://github.com/huggingface/tokenizers/blob/adf90dcd722cbc20af930441988b317a19815878/tokenizers/src/tokenizer/mod.rs#L1028
tokenizer.normalizer = normalizers.Replace(Regex('[0-9]+'), '[NUM]')

trainer = UnigramTrainer()
tokenizer.train_from_iterator(iterator, trainer)

During encoding, extract_and_normalize of added_vocabulary is called which breaks the input sequence on added tokens so as to prevent tokenizing them, but during training such a thing does not take place and added tokens existing in corpus are used in model training.

The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.

Jul 13 '22 14:07 sadra-barikbin

The solution might be to ask added_vocabulary to break input sequences on added tokens then feed them to the trainer.

I think you're spot on. PRs are welcome on this, but as you might understand this is a very delicate change to operate, as backward compatibility is critical now for transformers to work properly. So this PR will take time no matter who takes up on it. (Also we're scarce on time to dedicate on tokenizers too unfortunately, we still have to release 13.0 which is going ot happen as soon as I have time to dedicate to it).

Jul 15 '22 10:07 Narsil