[Feature] Add ability to manual edit vocabulary (add/remove subwords)
For my use-case, I end up with a bunch of subwords in my vocabulary that never show up after encoding my training set. I'd like to be able to remove these subwords from my vocabulary.
Is it really important to be able to remove some tokens if you could just preprocess your training data so some characters are not going to be merged together after training the YTTM model?
I do want the character to be merged together, as an intermediate step on the way towards a larger subword. I want to retain that larger subword but remove some the intermediate ones.
Scenario: if "hippopotamus" is a common word in my document, then YTTM will learn to include "hippopotamus" in my subword vocabulary. However, that means that there will be some number of intermediate subwords in the vocabulary that were necessary to form "hippopotamus" but are very very unlikely to ever actually appear in my data (e.g., "potamus").
We agree that it could be considered as a downside of the bpe algorithm. However, we believe that such cases are not going to be very frequent.
Furthermore, removing some tokens from vocabulary would require significant changes in the algorithm which wouldn't pay off.
Well that's certainly disappointing to hear. Would you mind pointing me towards where in the codebase I'd need to dive in if I wanted to implement this myself?
Also, this could easily be implemented at training time. When combining subwords ABC and DEF to add subword ABCDEF, just compare the frequencies of ABCDEF, ABC, and DEF (which should already be computed as part of the BPE algorithm). Then if freq(ABCDEF) == freq(ABC), remove ABC from the existing subword vocabulary (in addition to adding ABCDEF).
The issue is that you have to store ABC and DEF if you want to encode some text later.
Consider looking at the function
https://github.com/VKCOM/YouTokenToMe/blob/c2ab3c86c07918dd0f9ef1e0445e6c79f504a64a/youtokentome/cpp/bpe.cpp#L1528
Why is it necessary to store ABC and DEF? Those are the subwords that I want dropped from my vocabulary (that is, assuming that the only times I observe ABC or DEF is when the are put together as ABCDEF).