OrderedDict not needed, and question and comment
A. @vthorsteinsson I see you added OrderedDict (and OrderedSet) in late 2019, when 3.6 was around without dict then not ordered by default.
If you only support 3.7 and higher, then it seems you can simplify the code, not sure if it will be faster, hopefully:
https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set
The answer is no, but as of Python 3.7 you can use the simple dict from the Python standard library with just keys (and values as None) for the same purpose.
https://github.com/mideind/Tokenizer/commit/946ffc7424d2b8be27fff6bd725ebae259fb2323
https://docs.python.org/3/library/collections.html
Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations. They have become less important now that the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7).
[make sure to read the rest there.]
https://deepsource.com/blog/python-performance-three-easy-tips
When initializing a new dictionary, using {} is much more performant than calling the dict built-in.
https://stackoverflow.com/questions/18422995/why-is-ordereddict-10x-slower-than-dict-and-list
B. I was looking up tokenization [list], i.e. BPE etc. for LLMs, and dropped in on your repo my accident. Such tokenizers were made first for English, mostly made optimal for it, and Icelandic and German an afterthought if that, Chinese has at least been worked on. I agree with Karpathy, I want tokenizers gone, at least in the long run, they are a solution, but also a problem for current LLMs. Do you do any work on such/LLMs?