YouTokenToMe
YouTokenToMe copied to clipboard
Doesn't consider combining characters.
In several languages there are graphemes that consist of several characters - typically it's a base followed by one or many combining characters. For example: a + ◌̈ = ä.
Youtokentome assumes that every character is a valid grapheme and generates tokens that may start with a combining character.
If would be beneficial to train and encode with an option to pre-merge all combining characters to their base characters before running the actual BPE.