Ivan Belonogov
Ivan Belonogov
Hi This behaviour isn’t related to BPE-dropout. If characters `'▁'` and `'h'` did not merge, then they occurred not too many times together. This means that the algorithm instead of...
Yes, I also meant this special token '▁'. (Edited the previous comment)
It is not obvious to me. In practice for reasonably large vocabulary special token '▁' is almost always merged with the first symbol.
YTTM is very similar to subword-nmt. The difference is the following: - In YTTM you can specify the exact number of tokens in the output vocabulary. - In Subword-nmt you...
Hi, @TIXFeniks Your suggestion looks reasonable. But I don't want to make one more options for disabling this type of splits. Every new options make interface more cumbersome and decrease...
How much RAM do you have?
Yes, that looks reasonable.
Is it common situation when datasets have significant percent non-printable characters? By significant I mean more than 0.1%. In other cases they can be easily filtered out by `coverage` option.
We don't support this feature right now. Maybe it will be added later. You can use the following workaround. Just append this special token in the end of your training...
No, there is no easy way to do it. If the training data is so large that it does not fit into memory, then most likely you can subsample random...