jaime-m-p comments

Results 8 comments of


                                            jaime-m-p

BPE Tokenizer: Multiple newlines doesn't merge into a single token

I have Llama3 regex implementation. I did some tests, generating texts (randomly merging strings from tokenizer.json) and comparing encodings to tiktoken's encoding. The main indea is first annotate all matched...

Unicode codepoint flags for custom regexs

> Looks like the tokenizer tests are failing on Windows for some reason: > > https://github.com/ggerganov/llama.cpp/actions/runs/9096294810/job/25001393493?pr=7245#step:12:2583 I can not debug this in local, it is possible to skip all but...

Unicode codepoint flags for custom regexs

The problem is the stack size limit in Windows. According to MSVC [\STACK](https://learn.microsoft.com/en-us/cpp/build/reference/stack-stack-allocations?view=msvc-170) documentation: *For ARM64, x86, and x64 machines, the default stack size is 1 MB.* `sizeof( std::array )`...

Unicode codepoint flags for custom regexs

I think I'm done here. Now I have the base to fix tokenizers. Brute force test found fail cases while testing more models (even llama-3 custom regex is failing).

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache)

I'm actually trying to fix similar issues. Let me check and see if I can fix.

Bug: Phi-3 Tokenizer Adds Whitespaces on re-tokenization (which invalidates KV-cache)

@JhonDan1999 Phi-3 tokenizer removes all whitespaces (spaces, new lines, tabs, etc) after this special tokens. See `rstrip` attributes in `./models/tokenizers/phi-3/tokenizer.json`: ```json { "content": "", "lstrip": false, "rstrip": true }, {...

llama : tokenizer unicode codepoint categories

> > TODO: Implement unicode regex collapse trick for all subcategories. > Do you expect any problems with this? More problems than I thought: - Need +29 *collapse codepoints* for...

llama : tokenizer unicode codepoint categories

I tested (subset of the brute-force tests) all available BPE models, including `tekken`. Same results as before this PR. Also tested the original `tekken` regex and seems correct too. The...