Use external hf_tokenizer in llama runner
Summary
Use https://github.com/pytorch-labs/tokenizers huggingface tokenizer in the Llama runner.
Results on Qwen2.5 with extension/llm/tokenizers checked out to https://github.com/pytorch-labs/tokenizers/pull/50:
Once upon a time, there was a little girl named Lily. She was very happy. She had a big garden in the back of her house. She planted many flowers in it. They were red, yellow and blue. They were very pretty. Lily loved them very much. One day, she was watering them. Suddenly, she heard a noise. It was a noise in the tree. She looked up. There was a big bird in the tree. It was eating one of Lily's flowers. Lily was very angry. She ran to the tree. "Hello!" she said to the bird. "What are you doing in my
I 00:00:08.624959 executorch:runner.cpp:294] RSS after finishing text generation: 2147.121094 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1744936315023,"model_load_end_ms":1744936318524,"inference_start_ms":1744936318524,"inference_end_ms":1744936323646,"prompt_eval_end_ms":1744936318580,"first_token_ms":1744936318580,"aggregate_sampling_time_ms":274877907025,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.625019 executorch:stats.h:106] Prompt Tokens: 4 Generated Tokens: 123
I 00:00:08.625021 executorch:stats.h:112] Model Load Time: 3.501000 (seconds)
I 00:00:08.625023 executorch:stats.h:119] Total inference time: 5.122000 (seconds) Rate: 24.014057 (tokens/second)
I 00:00:08.625033 executorch:stats.h:129] Prompt evaluation: 0.056000 (seconds) Rate: 71.428571 (tokens/second)
I 00:00:08.625038 executorch:stats.h:138] Generated 123 tokens: 5.066000 (seconds) Rate: 24.279510 (tokens/second)
I 00:00:08.625045 executorch:stats.h:149] Time to first generated token: 0.056000 (seconds)
I 00:00:08.625047 executorch:stats.h:155] Sampling time over 127 tokens: 274877907.025000 (seconds)
Test plan
Build llama runner locally (note the inclusion of -DSUPPORT_REGEX_LOOKAHEAD=ON):
cmake -DPYTHON_EXECUTABLE=python \
-DCMAKE_INSTALL_PREFIX=cmake-out \
-DCMAKE_BUILD_TYPE=Release \
-DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
-DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
-DEXECUTORCH_BUILD_XNNPACK=ON \
-DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
-DSUPPORT_REGEX_LOOKAHEAD=ON \
-Bcmake-out/examples/models/llama \
examples/models/llama
cmake --build cmake-out/examples/models/llama -j16 --config Release
Run on Qwen2.5:
cmake-out/examples/models/llama/llama_main --model_path=qwen2_5.pte --tokenizer_path ~/hf/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json --prompt="Once upon a time" --temperature 0
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9112
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: You can merge normally! (1 Unrelated Failure)
As of commit 4ff5d8b67720f86b363b62bf15b4e6ad0926fbca with merge base c5dd4767eb59707e906199f12e61f2109cf04004 ():
FLAKY - The following job failed but was likely due to flakiness present on trunk:
- pull / test-eval_llama-wikitext-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@jackzhxng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
Getting this error when running the llm runner with a HF tokenizer:
failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:tiktoken.cpp:92] failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
I 00:00:01.377955 executorch:runner.cpp:121] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a Tiktoken artifact, trying BPE tokenizer
E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json
E 00:00:01.377960 executorch:runner.cpp:129] Tokenizer error: 4
E 00:00:01.377962 executorch:runner.cpp:129] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a llama2.c tokenizer artifact
I also tried a hack with tokenizer.model for llama3.2-1b, it failed as well.
Getting this error when running the llm runner with a HF tokenizer:
failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json E tokenizers:tiktoken.cpp:92] failed to open encoder file: ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json I 00:00:01.377955 executorch:runner.cpp:121] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a Tiktoken artifact, trying BPE tokenizer E tokenizers:llama2c_tokenizer.cpp:49] couldn't load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json E 00:00:01.377960 executorch:runner.cpp:129] Tokenizer error: 4 E 00:00:01.377962 executorch:runner.cpp:129] Failed to load ~/.cache/huggingface/hub/models--google--gemma-3-1b-it/snapshots/dcc83ea841ab6100d6b47a070329e1ba4cf78752/tokenizer.json as a llama2.c tokenizer artifactI also tried a hack with
tokenizer.modelfor llama3.2-1b, it failed as well.
Yeah existing logic tries to deserialize the artifact as a tiktoken and then fallback to BPE tokenizer. We need some logic to use hf tokenizer.
@guangy10 https://github.com/pytorch/executorch/pull/10326 this should allow arbitrary tokenizer to be passed into runner.
@guangy10 can you try building the runner with -DSUPPORT_REGEX_LOOKAHEAD=ON
DSUPPORT_REGEX_LOOKAHEAD
I was just using the build command in your test plan. This flag is set there I believe.