executorch Use external hf_tokenizer in llama runner

Summary

Use https://github.com/pytorch-labs/tokenizers huggingface tokenizer in the Llama runner.

Results on Qwen2.5 with extension/llm/tokenizers checked out to https://github.com/pytorch-labs/tokenizers/pull/50:

Once upon a time,  there was a little girl named Lily. She was very happy. She had a big garden in the back of her house. She planted many flowers in it. They were red, yellow and blue. They were very pretty. Lily loved them very much. One day, she was watering them. Suddenly, she heard a noise. It was a noise in the tree. She looked up. There was a big bird in the tree. It was eating one of Lily's flowers. Lily was very angry. She ran to the tree. "Hello!" she said to the bird. "What are you doing in my
I 00:00:08.624959 executorch:runner.cpp:294] RSS after finishing text generation: 2147.121094 MiB (0 if unsupported)
PyTorchObserver {"prompt_tokens":4,"generated_tokens":123,"model_load_start_ms":1744936315023,"model_load_end_ms":1744936318524,"inference_start_ms":1744936318524,"inference_end_ms":1744936323646,"prompt_eval_end_ms":1744936318580,"first_token_ms":1744936318580,"aggregate_sampling_time_ms":274877907025,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:08.625019 executorch:stats.h:106]       Prompt Tokens: 4    Generated Tokens: 123
I 00:00:08.625021 executorch:stats.h:112]       Model Load Time:                3.501000 (seconds)
I 00:00:08.625023 executorch:stats.h:119]       Total inference time:           5.122000 (seconds)               Rate:  24.014057 (tokens/second)
I 00:00:08.625033 executorch:stats.h:129]               Prompt evaluation:      0.056000 (seconds)               Rate:  71.428571 (tokens/second)
I 00:00:08.625038 executorch:stats.h:138]               Generated 123 tokens:   5.066000 (seconds)               Rate:  24.279510 (tokens/second)
I 00:00:08.625045 executorch:stats.h:149]       Time to first generated token:  0.056000 (seconds)
I 00:00:08.625047 executorch:stats.h:155]       Sampling time over 127 tokens:  274877907.025000 (seconds)

Test plan

Build llama runner locally (note the inclusion of -DSUPPORT_REGEX_LOOKAHEAD=ON):

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DSUPPORT_REGEX_LOOKAHEAD=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release

Run on Qwen2.5:

cmake-out/examples/models/llama/llama_main --model_path=qwen2_5.pte --tokenizer_path ~/hf/models--Qwen--Qwen2.5-1.5B/snapshots/8faed761d45a263340a0528343f099c05c9a4323/tokenizer.json --prompt="Once upon a time" --temperature 0

Mar 10 '25 23:03 jackzhxng

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9112

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: You can merge normally! (1 Unrelated Failure)

As of commit 4ff5d8b67720f86b363b62bf15b4e6ad0926fbca with merge base c5dd4767eb59707e906199f12e61f2109cf04004 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-eval_llama-wikitext-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.