jaime-m-p

Results 8 comments of jaime-m-p

I have Llama3 regex implementation. I did some tests, generating texts (randomly merging strings from tokenizer.json) and comparing encodings to tiktoken's encoding. The main indea is first annotate all matched...

> Looks like the tokenizer tests are failing on Windows for some reason: > > https://github.com/ggerganov/llama.cpp/actions/runs/9096294810/job/25001393493?pr=7245#step:12:2583 I can not debug this in local, it is possible to skip all but...

The problem is the stack size limit in Windows. According to MSVC [\STACK](https://learn.microsoft.com/en-us/cpp/build/reference/stack-stack-allocations?view=msvc-170) documentation: *For ARM64, x86, and x64 machines, the default stack size is 1 MB.* `sizeof( std::array )`...

I think I'm done here. Now I have the base to fix tokenizers. Brute force test found fail cases while testing more models (even llama-3 custom regex is failing).

I'm actually trying to fix similar issues. Let me check and see if I can fix.

@JhonDan1999 Phi-3 tokenizer removes all whitespaces (spaces, new lines, tabs, etc) after this special tokens. See `rstrip` attributes in `./models/tokenizers/phi-3/tokenizer.json`: ```json { "content": "", "lstrip": false, "rstrip": true }, {...

> > TODO: Implement unicode regex collapse trick for all subcategories. > Do you expect any problems with this? More problems than I thought: - Need +29 *collapse codepoints* for...

I tested (subset of the brute-force tests) all available BPE models, including `tekken`. Same results as before this PR. Also tested the original `tekken` regex and seems correct too. The...