NotImplementedError: Vlm do not work with prefix caching yet
System Info
Hello, model: gemma3 tgi version: 3.2.0 graphic card: 1 x h100 80gb os: ubuntu 24 cloud: digitalocean
all tgi parameters in default
logs:
2025-03-13T13:55:10.739163Z INFO text_generation_launcher: Args {
model_id: "google/gemma-3-27b-it",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: None,
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "dd0b9cf5c3f3",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
2025-03-13T13:55:12.077667Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true
2025-03-13T13:55:12.096265Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096
2025-03-13T13:55:12.096276Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-13T13:55:12.096368Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it
2025-03-13T13:55:15.225392Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-13T13:55:15.716334Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it
2025-03-13T13:55:15.716547Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-13T13:55:18.850149Z INFO text_generation_launcher: Using prefix caching = True
2025-03-13T13:55:18.850171Z INFO text_generation_launcher: Using Attention = flashinfer
2025-03-13T13:55:24.517462Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
File "/usr/src/.venv/bin/text-generation-server", line 10, in
File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner model = get_model_with_lora_adapters( File "/usr/src/server/text_generation_server/models/init.py", line 1690, in get_model_with_lora_adapters model = get_model( File "/usr/src/server/text_generation_server/models/init.py", line 1159, in get_model return VlmCausalLM( File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 352, in init raise NotImplementedError("Vlm do not work with prefix caching yet") NotImplementedError: Vlm do not work with prefix caching yet 2025-03-13T13:55:25.641160Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-03-13 13:55:17.001 | INFO | text_generation_server.utils.import_utils:torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd(cast_inputs=torch.float16)
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
@custom_fwd
/usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
@custom_bwd
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /usr/src/server/text_generation_server/cli.py:119 in serve │
│ │
│ 116 │ │ raise RuntimeError( │
│ 117 │ │ │ "Only 1 can be set between dtype and quantize, as they │
│ 118 │ │ ) │
│ ❱ 119 │ server.serve( │
│ 120 │ │ model_id, │
│ 121 │ │ lora_adapters, │
│ 122 │ │ revision, │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ dtype = None │ │
│ │ json_output = True │ │
│ │ kv_cache_dtype = None │ │
│ │ logger_level = 'INFO' │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'google/gemma-3-27b-it' │ │
│ │ otlp_endpoint = None │ │
│ │ otlp_service_name = 'text-generation-inference.router' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ server = <module 'text_generation_server.server' from │ │
│ │ '/usr/src/server/text_generation_server/server.py'> │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────────────────╯ │
│ │
│ /usr/src/server/text_generation_server/server.py:315 in serve │
│ │
│ 312 │ │ while signal_handler.KEEP_PROCESSING: │
│ 313 │ │ │ await asyncio.sleep(0.5) │
│ 314 │ │
│ ❱ 315 │ asyncio.run( │
│ 316 │ │ serve_inner( │
│ 317 │ │ │ model_id, │
│ 318 │ │ │ lora_adapters, │
│ │
│ ╭─────────────────────────── locals ───────────────────────────╮ │
│ │ dtype = None │ │
│ │ kv_cache_dtype = None │ │
│ │ lora_adapters = [] │ │
│ │ max_input_tokens = None │ │
│ │ model_id = 'google/gemma-3-27b-it' │ │
│ │ quantize = None │ │
│ │ revision = None │ │
│ │ sharded = False │ │
│ │ speculate = None │ │
│ │ trust_remote_code = False │ │
│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │
│ ╰──────────────────────────────────────────────────────────────╯ │
│ │
│ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │
│ /asyncio/runners.py:190 in run │
│ │
│ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │
│ 188 │ │
│ 189 │ with Runner(debug=debug) as runner: │
│ ❱ 190 │ │ return runner.run(main) │
│ 191 │
│ 192 │
│ 193 def _cancel_all_tasks(loop): │
│ │
│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │
│ │ debug = None │ │
│ │ main = <coroutine object serve.
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
docker run -d --name tgi
--gpus all
-e MODEL_ID=google/gemma-3-27b-it
-e HF_HOME=/HF_CACHE
-p 127.0.0.1:8080:8080
-v "/home/huggningface_cache/:/HF_CACHE"
ghcr.io/huggingface/text-generation-inference:3.2.0
Expected behavior
just start
Why doesnt it work?
Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯
I have a similar issue:
NotImplementedError: Vlm do not work with prefix caching yet rank=0
2025-03-14T13:03:29.652543Z ERROR text_generation_launcher: Shard 0 failed to start
2025-03-14T13:03:29.652573Z INFO text_generation_launcher: Shutting down shards
Even passing PREFIX_CACHE=0 via docker env doesn't help.
Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯
Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:
$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4
[...]
2025-03-17T09:48:59.871425Z INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T09:48:59.871451Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T09:48:59.871455Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
[...]
Could you give more information how you are starting TGI
In project im using docker compose, but for reproduction I test directly with docker run. Im not sure there is difference between running in docker and tgi launcher itself, but this is not working 👇
docker run -d --name tgi
--gpus all
-e MODEL_ID=google/gemma-3-27b-it
-e HF_HOME=/HF_CACHE
-p 127.0.0.1:8080:8080
-v "/home/huggningface_cache/:/HF_CACHE"
ghcr.io/huggingface/text-generation-inference:3.2.0
Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯
Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:
$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4 [...] 2025-03-17T09:48:59.871425Z INFO text_generation_launcher: Disabling prefix caching because of VLM model 2025-03-17T09:48:59.871451Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching 2025-03-17T09:48:59.871455Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0 [...]
Thanks, I am also starting it similarly. I've never mentioned anything regarding prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I pass PREFIX_CACHE=0 via -e into the docker.
How I launch:
docker run \
--name "${NAME}" \
--gpus all \
--shm-size 8g \
-p $PORT:80 \
-e HUGGING_FACE_HUB_TOKEN=... \
-e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
-e HF_HUB_OFFLINE=0 \
-e TRUST_REMOTE_CODE=true \
-v $volume:/data \
--detach \
ghcr.io/huggingface/text-generation-inference:3.2.0 \
--model-id $model_id \
$revision_flag \
--sharded $SHARDED \
$quantize_flag \
--num-shard $num_shard \
--cuda-memory-fraction=$cuda_fraction \
$rope_flag \
--max-input-length=$MAX_TOKEN_LENGTH \
--max-total-tokens=$MAX_TOTAL_TOKENS \
Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯
Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:
$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4 [...] 2025-03-17T09:48:59.871425Z INFO text_generation_launcher: Disabling prefix caching because of VLM model 2025-03-17T09:48:59.871451Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching 2025-03-17T09:48:59.871455Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0 [...]Thanks, I am also starting it similarly. I've never mentioned anything regarding
prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I passPREFIX_CACHE=0via-einto the docker.How I launch:
docker run \ --name "${NAME}" \ --gpus all \ --shm-size 8g \ -p $PORT:80 \ -e HUGGING_FACE_HUB_TOKEN=... \ -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ -e HF_HUB_OFFLINE=0 \ -e TRUST_REMOTE_CODE=true \ -v $volume:/data \ --detach \ ghcr.io/huggingface/text-generation-inference:3.2.0 \ --model-id $model_id \ $revision_flag \ --sharded $SHARDED \ $quantize_flag \ --num-shard $num_shard \ --cuda-memory-fraction=$cuda_fraction \ $rope_flag \ --max-input-length=$MAX_TOKEN_LENGTH \ --max-total-tokens=$MAX_TOTAL_TOKENS \
Hi guys, try passing PREFIX_CACHING=0 instead of PREFIX_CACHE=0.
Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯
Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:
$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4 [...] 2025-03-17T09:48:59.871425Z INFO text_generation_launcher: Disabling prefix caching because of VLM model 2025-03-17T09:48:59.871451Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching 2025-03-17T09:48:59.871455Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0 [...]Thanks, I am also starting it similarly. I've never mentioned anything regarding
prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I passPREFIX_CACHE=0via-einto the docker. How I launch:docker run \ --name "${NAME}" \ --gpus all \ --shm-size 8g \ -p $PORT:80 \ -e HUGGING_FACE_HUB_TOKEN=... \ -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \ -e HF_HUB_OFFLINE=0 \ -e TRUST_REMOTE_CODE=true \ -v $volume:/data \ --detach \ ghcr.io/huggingface/text-generation-inference:3.2.0 \ --model-id $model_id \ $revision_flag \ --sharded $SHARDED \ $quantize_flag \ --num-shard $num_shard \ --cuda-memory-fraction=$cuda_fraction \ $rope_flag \ --max-input-length=$MAX_TOKEN_LENGTH \ --max-total-tokens=$MAX_TOTAL_TOKENS \Hi guys, try passing
PREFIX_CACHING=0instead ofPREFIX_CACHE=0.
Thanks @EgorSWEB - that worked!
I can see there is a logic here to set the value dynamically, but I think it fails to recognize the model is a VLM so automatically set the value to 0.
https://github.com/huggingface/text-generation-inference/blob/e497bc09f6107baae4f06d6d31fc18730d0970c3/server/text_generation_server/models/globals.py#L10
It seems the reason TGI's failing to recognize the model as VLM is due to an issue when HF_HUB_OFFLINE=1 is set.
I think there's a mismatch between how the launcher and router try to resolve the config when in offline mode. For example, I think the reason the launcher tries to set prefix caching true is because:
- get_config() by default tries to use the ApiBuilder which reaches out to the Hub, and since there's no token env var, the request fails: Err(RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-4b-it/resolve/main/config.json])))
- This then sets the Config here to None (when really it should raise an error)
- Since config is None, the resolve_attention function is skipping over the if config.vision_config.is_some() completely, and thus sets attention to "flashinfer" and prefix_aching to "true"
It seems this is somehow related to this other issue.
In the meantime, manually setting PREFIX_CACHING=0 should work. cc. @danieldk
Thanks @andrewrreed, perfect observation! It makes sense. I have started using the models successfully by setting PREFIX_CACHING=0 for now. Appreciate the help.