executorch icon indicating copy to clipboard operation
executorch copied to clipboard

Use external hf_tokenizer in llama runner

Open jackzhxng opened this issue 10 months ago • 5 comments

Summary

Use https://github.com/pytorch-labs/tokenizers huggingface tokenizer in the Llama runner.

Results on Qwen2.5 with extension/llm/tokenizers checked out to https://github.com/pytorch-labs/tokenizers/pull/50:

Once upon a time,  there was a little girl named Lily. She was very happy. She had a big garden in the back of her house. She planted many flowers in it. They were red, yellow and blue. They were very pretty. Lily loved them very much. One day, she was watering them. Suddenly, she heard a noise. It was a noise in the tree. She looked up. There was a big bird in the tree. It was eating one of Lily's flowers. Lily was very angry. She ran to the tree. "Hello!" she said to the bird. "What are you doing in my
I 00:00:08.624959 executorch:runner.cpp:294] RSS after finishing text generation: 2147.121094 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1744936315023,"model_load_end_ms":1744936318524,"inference_start_ms":1744936318524,"inference_end_ms":1744936323646,"prompt_eval_end_ms":1744936318580,"first_token_ms":1744936318580,"aggregate_sampling_time_ms":274877907025,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.625019 executorch:stats.h:106]       Prompt Tokens: 4    Generated Tokens: 123
I 00:00:08.625021 executorch:stats.h:112]       Model Load Time:                3.501000 (seconds)
I 00:00:08.625023 executorch:stats.h:119]       Total inference time:           5.122000 (seconds)               Rate:  24.014057 (tokens/second)
I 00:00:08.625033 executorch:stats.h:129]               Prompt evaluation:      0.056000 (seconds)               Rate:  71.428571 (tokens/second)
I 00:00:08.625038 executorch:stats.h:138]               Generated 123 tokens:   5.066000 (seconds)               Rate:  24.279510 (tokens/second)
I 00:00:08.625045 executorch:stats.h:149]       Time to first generated token:  0.056000 (seconds)
I 00:00:08.625047 executorch:stats.h:155]       Sampling time over 127 tokens:  274877907.025000 (seconds)

Test plan

Build llama runner locally (note the inclusion of -DSUPPORT_REGEX_LOOKAHEAD=ON):

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release

Run on Qwen2.5:

cmake-out/examples/models/llama/llama_main --model_path=qwen2_5.pte --tokenizer_path ~/hf/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json --prompt="Once upon a time" --temperature 0

jackzhxng avatar Mar 10 '25 23:03 jackzhxng

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9112

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (1 Unrelated Failure)

As of commit 4ff5d8b67720f86b363b62bf15b4e6ad0926fbca with merge base c5dd4767eb59707e906199f12e61f2109cf04004 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Mar 10 '25 23:03 pytorch-bot[bot]

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 18 '25 00:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 25 '25 19:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 25 '25 20:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 28 '25 15:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 28 '25 18:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 29 '25 07:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 29 '25 17:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 29 '25 17:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 29 '25 18:04 facebook-github-bot

@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Apr 30 '25 03:04 facebook-github-bot

Getting this error when running the llm runner with a HF tokenizer:

failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:tiktoken.cpp:92] failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
I 00:00:01.377955 executorch:runner.cpp:121] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a Tiktoken artifact, trying BPE tokenizer
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E 00:00:01.377960 executorch:runner.cpp:129] Tokenizer error: 4
E 00:00:01.377962 executorch:runner.cpp:129] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a llama2.c tokenizer artifact

I also tried a hack with tokenizer.model for llama3.2-1b, it failed as well.

guangy10 avatar Apr 30 '25 23:04 guangy10

Getting this error when running the llm runner with a HF tokenizer:

failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:tiktoken.cpp:92] failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
I 00:00:01.377955 executorch:runner.cpp:121] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a Tiktoken artifact, trying BPE tokenizer
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E 00:00:01.377960 executorch:runner.cpp:129] Tokenizer error: 4
E 00:00:01.377962 executorch:runner.cpp:129] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a llama2.c tokenizer artifact

I also tried a hack with tokenizer.model for llama3.2-1b, it failed as well.

Yeah existing logic tries to deserialize the artifact as a tiktoken and then fallback to BPE tokenizer. We need some logic to use hf tokenizer.

larryliu0820 avatar Apr 30 '25 23:04 larryliu0820

@guangy10 https://github.com/pytorch/executorch/pull/10326 this should allow arbitrary tokenizer to be passed into runner.

larryliu0820 avatar Apr 30 '25 23:04 larryliu0820

@guangy10 can you try building the runner with -DSUPPORT_REGEX_LOOKAHEAD=ON

jackzhxng avatar May 01 '25 05:05 jackzhxng

DSUPPORT_REGEX_LOOKAHEAD

I was just using the build command in your test plan. This flag is set there I believe.

guangy10 avatar May 01 '25 18:05 guangy10