jaime-m-p
jaime-m-p
Use flags for each unicode category (`\p{N}`, `\p{L}`, `\p{Z}`, ...) instead of definitions `CODEPOINT_TYPE_*`. Including helper flags for common regex params like `\s` (only this for now), `\d`, `\w`... This...
Add all unicode [categories](https://www.compart.com/en/unicode/category) to `unicode-data.cpp`. Currently we are limited to high categories: * C, L, M, N, P, S, Z. This PR allows access to subcategories: * Cn, Cc,...
More tokenizer fixes. --- - [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md) - Self-reported review complexity: - [x] Low - [ ] Medium - [ ] High --- Examples of...