text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

NotImplementedError: Vlm do not work with prefix caching yet

Open AndriiBihun opened this issue 1 year ago • 10 comments

System Info

Hello, model: gemma3 tgi version: 3.2.0 graphic card: 1 x h100 80gb os: ubuntu 24 cloud: digitalocean

all tgi parameters in default

logs: 2025-03-13T13:55:10.739163Z INFO text_generation_launcher: Args { model_id: "google/gemma-3-27b-it", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, kv_cache_dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "dd0b9cf5c3f3", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], api_key: None, watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, usage_stats: On, payload_limit: 2000000, enable_prefill_logprobs: false, } 2025-03-13T13:55:12.077667Z INFO text_generation_launcher: Using attention flashinfer - Prefix caching true 2025-03-13T13:55:12.096265Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4096 2025-03-13T13:55:12.096276Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2025-03-13T13:55:12.096368Z INFO download: text_generation_launcher: Starting check and download process for google/gemma-3-27b-it 2025-03-13T13:55:15.225392Z INFO text_generation_launcher: Files are already present on the host. Skipping download. 2025-03-13T13:55:15.716334Z INFO download: text_generation_launcher: Successfully downloaded weights for google/gemma-3-27b-it 2025-03-13T13:55:15.716547Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2025-03-13T13:55:18.850149Z INFO text_generation_launcher: Using prefix caching = True 2025-03-13T13:55:18.850171Z INFO text_generation_launcher: Using Attention = flashinfer 2025-03-13T13:55:24.517462Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/usr/src/.venv/bin/text-generation-server", line 10, in sys.exit(app()) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 323, in call return get_command(self)(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 743, in main return _main( File "/usr/src/.venv/lib/python3.11/site-packages/typer/core.py", line 198, in _main rv = self.invoke(ctx) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/src/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/usr/src/.venv/lib/python3.11/site-packages/typer/main.py", line 698, in wrapper return callback(**use_params) File "/usr/src/server/text_generation_server/cli.py", line 119, in serve server.serve( File "/usr/src/server/text_generation_server/server.py", line 315, in serve asyncio.run( File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete self.run_forever() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 608, in run_forever self._run_once() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once handle._run() File "/root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11/asyncio/events.py", line 84, in _run self._context.run(self._callback, *self._args)

File "/usr/src/server/text_generation_server/server.py", line 268, in serve_inner model = get_model_with_lora_adapters( File "/usr/src/server/text_generation_server/models/init.py", line 1690, in get_model_with_lora_adapters model = get_model( File "/usr/src/server/text_generation_server/models/init.py", line 1159, in get_model return VlmCausalLM( File "/usr/src/server/text_generation_server/models/vlm_causal_lm.py", line 352, in init raise NotImplementedError("Vlm do not work with prefix caching yet") NotImplementedError: Vlm do not work with prefix caching yet 2025-03-13T13:55:25.641160Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-03-13 13:55:17.001 | INFO | text_generation_server.utils.import_utils::80 - Detected system cuda /usr/src/server/text_generation_server/layers/gptq/triton.py:242: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd(cast_inputs=torch.float16) /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:158: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/selective_scan_interface.py:231: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @custom_bwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:507: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd /usr/src/.venv/lib/python3.11/site-packages/mamba_ssm/ops/triton/layernorm.py:566: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. @custom_bwd ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /usr/src/server/text_generation_server/cli.py:119 in serve │ │ │ │ 116 │ │ raise RuntimeError( │ │ 117 │ │ │ "Only 1 can be set between dtype and quantize, as they │ │ 118 │ │ ) │ │ ❱ 119 │ server.serve( │ │ 120 │ │ model_id, │ │ 121 │ │ lora_adapters, │ │ 122 │ │ revision, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ dtype = None │ │ │ │ json_output = True │ │ │ │ kv_cache_dtype = None │ │ │ │ logger_level = 'INFO' │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ otlp_endpoint = None │ │ │ │ otlp_service_name = 'text-generation-inference.router' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ server = <module 'text_generation_server.server' from │ │ │ │ '/usr/src/server/text_generation_server/server.py'> │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/server.py:315 in serve │ │ │ │ 312 │ │ while signal_handler.KEEP_PROCESSING: │ │ 313 │ │ │ await asyncio.sleep(0.5) │ │ 314 │ │ │ ❱ 315 │ asyncio.run( │ │ 316 │ │ serve_inner( │ │ 317 │ │ │ model_id, │ │ 318 │ │ │ lora_adapters, │ │ │ │ ╭─────────────────────────── locals ───────────────────────────╮ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ ╰──────────────────────────────────────────────────────────────╯ │ │ │ │ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │ │ /asyncio/runners.py:190 in run │ │ │ │ 187 │ │ │ "asyncio.run() cannot be called from a running event loop" │ │ 188 │ │ │ 189 │ with Runner(debug=debug) as runner: │ │ ❱ 190 │ │ return runner.run(main) │ │ 191 │ │ 192 │ │ 193 def _cancel_all_tasks(loop): │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ debug = None │ │ │ │ main = <coroutine object serve..serve_inner at 0x7fd5592317e0> │ │ │ │ runner = <asyncio.runners.Runner object at 0x7fd9057aead0> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │ │ /asyncio/runners.py:118 in run │ │ │ │ 115 │ │ │ │ 116 │ │ self._interrupt_count = 0 │ │ 117 │ │ try: │ │ ❱ 118 │ │ │ return self._loop.run_until_complete(task) │ │ 119 │ │ except exceptions.CancelledError: │ │ 120 │ │ │ if self._interrupt_count > 0: │ │ 121 │ │ │ │ uncancel = getattr(task, "uncancel", None) │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ context = <_contextvars.Context object at 0x7fd55b506c40> │ │ │ │ coro = <coroutine object serve..serve_inner at │ │ │ │ 0x7fd5592317e0> │ │ │ │ self = <asyncio.runners.Runner object at 0x7fd9057aead0> │ │ │ │ sigint_handler = functools.partial(<bound method Runner._on_sigint of │ │ │ │ <asyncio.runners.Runner object at 0x7fd9057aead0>>, │ │ │ │ main_task=<Task finished name='Task-1' │ │ │ │ coro=<serve..serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=NotImplementedError('Vlm do not work with │ │ │ │ prefix caching yet')>) │ │ │ │ task = <Task finished name='Task-1' │ │ │ │ coro=<serve..serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=NotImplementedError('Vlm do not work with │ │ │ │ prefix caching yet')> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /root/.local/share/uv/python/cpython-3.11.11-linux-x86_64-gnu/lib/python3.11 │ │ /asyncio/base_events.py:654 in run_until_complete │ │ │ │ 651 │ │ if not future.done(): │ │ 652 │ │ │ raise RuntimeError('Event loop stopped before Future comp │ │ 653 │ │ │ │ ❱ 654 │ │ return future.result() │ │ 655 │ │ │ 656 │ def stop(self): │ │ 657 │ │ """Stop running the event loop. │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ future = <Task finished name='Task-1' │ │ │ │ coro=<serve..serve_inner() done, defined at │ │ │ │ /usr/src/server/text_generation_server/server.py:244> │ │ │ │ exception=NotImplementedError('Vlm do not work with prefix │ │ │ │ caching yet')> │ │ │ │ new_task = False │ │ │ │ self = <_UnixSelectorEventLoop running=False closed=True │ │ │ │ debug=False> │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/server.py:268 in serve_inner │ │ │ │ 265 │ │ │ server_urls = [local_url] │ │ 266 │ │ │ │ 267 │ │ try: │ │ ❱ 268 │ │ │ model = get_model_with_lora_adapters( │ │ 269 │ │ │ │ model_id, │ │ 270 │ │ │ │ lora_adapters, │ │ 271 │ │ │ │ revision, │ │ │ │ ╭──────────────────────────── locals ─────────────────────────────╮ │ │ │ adapter_to_index = {} │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ local_url = 'unix:///tmp/text-generation-server-0' │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ server_urls = ['unix:///tmp/text-generation-server-0'] │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ │ uds_path = PosixPath('/tmp/text-generation-server') │ │ │ │ unix_socket_template = 'unix://{}-{}' │ │ │ ╰─────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/models/init.py:1690 in │ │ get_model_with_lora_adapters │ │ │ │ 1687 │ adapter_to_index: Dict[str, int], │ │ 1688 ): │ │ 1689 │ lora_adapter_ids = [adapter.id for adapter in lora_adapters] │ │ ❱ 1690 │ model = get_model( │ │ 1691 │ │ model_id, │ │ 1692 │ │ lora_adapter_ids, │ │ 1693 │ │ revision, │ │ │ │ ╭────────────────── locals ───────────────────╮ │ │ │ adapter_to_index = {} │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ lora_adapter_ids = [] │ │ │ │ lora_adapters = [] │ │ │ │ max_input_tokens = None │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ speculate = None │ │ │ │ trust_remote_code = False │ │ │ ╰─────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/models/init.py:1159 in get_model │ │ │ │ 1156 │ elif model_type == GEMMA3: │ │ 1157 │ │ if FLASH_ATTENTION: │ │ 1158 │ │ │ # TODO: Use VlmCausalLM when image support is added. │ │ ❱ 1159 │ │ │ return VlmCausalLM( │ │ 1160 │ │ │ │ model_id=model_id, │ │ 1161 │ │ │ │ model_class=Gemma3ForConditionalGeneration, │ │ 1162 │ │ │ │ revision=revision, │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ _ = {} │ │ │ │ compressed_tensors_config = None │ │ │ │ config_dict = { │ │ │ │ │ 'architectures': [ │ │ │ │ │ │ 'Gemma3ForConditionalGeneration' │ │ │ │ │ ], │ │ │ │ │ 'boi_token_index': 255999, │ │ │ │ │ 'eoi_token_index': 256000, │ │ │ │ │ 'eos_token_id': [1, 106], │ │ │ │ │ 'image_token_index': 262144, │ │ │ │ │ 'initializer_range': 0.02, │ │ │ │ │ 'mm_tokens_per_image': 256, │ │ │ │ │ 'model_type': 'gemma3', │ │ │ │ │ 'text_config': { │ │ │ │ │ │ 'head_dim': 128, │ │ │ │ │ │ 'hidden_size': 5376, │ │ │ │ │ │ 'intermediate_size': 21504, │ │ │ │ │ │ 'model_type': 'gemma3_text', │ │ │ │ │ │ 'num_attention_heads': 32, │ │ │ │ │ │ 'num_hidden_layers': 62, │ │ │ │ │ │ 'num_key_value_heads': 16, │ │ │ │ │ │ 'query_pre_attn_scalar': 168, │ │ │ │ │ │ 'rope_scaling': { │ │ │ │ │ │ │ 'factor': 8.0, │ │ │ │ │ │ │ 'rope_type': 'linear' │ │ │ │ │ │ }, │ │ │ │ │ │ 'sliding_window': 1024 │ │ │ │ │ }, │ │ │ │ │ 'torch_dtype': 'bfloat16', │ │ │ │ │ ... +3 │ │ │ │ } │ │ │ │ dtype = None │ │ │ │ kv_cache_dtype = None │ │ │ │ kv_cache_scheme = None │ │ │ │ lora_adapter_ids = [] │ │ │ │ max_input_tokens = None │ │ │ │ method = 'n-gram' │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ model_type = 'gemma3' │ │ │ │ needs_sliding_window = False │ │ │ │ quantization_config = None │ │ │ │ quantize = None │ │ │ │ revision = None │ │ │ │ sharded = False │ │ │ │ sliding_window = -1 │ │ │ │ speculate = 0 │ │ │ │ speculator = None │ │ │ │ trust_remote_code = False │ │ │ │ use_sliding_window = False │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ │ │ │ /usr/src/server/text_generation_server/models/vlm_causal_lm.py:352 in │ │ init │ │ │ │ 349 │ │ **kwargs, │ │ 350 │ ): │ │ 351 │ │ if PREFIX_CACHING: │ │ ❱ 352 │ │ │ raise NotImplementedError("Vlm do not work with prefix cac │ │ 353 │ │ if processor_kwargs is None: │ │ 354 │ │ │ processor_kwargs = {} │ │ 355 │ │ self.processor = processor_class.from_pretrained( │ │ │ │ ╭───────────────────────────────── locals ─────────────────────────────────╮ │ │ │ kwargs = { │ │ │ │ │ 'model_class': <class │ │ │ │ 'text_generation_server.models.custom_modeling.flas… │ │ │ │ │ 'quantize': None, │ │ │ │ │ 'speculator': None, │ │ │ │ │ 'dtype': None, │ │ │ │ │ 'kv_cache_dtype': None, │ │ │ │ │ 'config_class': <class │ │ │ │ 'text_generation_server.models.custom_modeling.gemm… │ │ │ │ │ 'default_dtype': torch.bfloat16, │ │ │ │ │ 'lora_adapter_ids': [] │ │ │ │ } │ │ │ │ model_id = 'google/gemma-3-27b-it' │ │ │ │ processor_kwargs = None │ │ │ │ revision = None │ │ │ │ self = <text_generation_server.models.vlm_causal_lm.VlmCau… │ │ │ │ object at 0x7fd5c0665b10> │ │ │ │ trust_remote_code = False │ │ │ ╰──────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────╯ NotImplementedError: Vlm do not work with prefix caching yet rank=0 2025-03-13T13:55:25.723052Z ERROR text_generation_launcher: Shard 0 failed to start 2025-03-13T13:55:25.723070Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart

Information

  • [x] Docker
  • [ ] The CLI directly

Tasks

  • [x] An officially supported command
  • [ ] My own modifications

Reproduction

docker run -d --name tgi
--gpus all
-e MODEL_ID=google/gemma-3-27b-it
-e HF_HOME=/HF_CACHE
-p 127.0.0.1:8080:8080
-v "/home/huggningface_cache/:/HF_CACHE"
ghcr.io/huggingface/text-generation-inference:3.2.0

Expected behavior

just start

AndriiBihun avatar Mar 13 '25 14:03 AndriiBihun

Why doesnt it work?

AndriiBihun avatar Mar 13 '25 14:03 AndriiBihun

Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯

AndriiBihun avatar Mar 13 '25 22:03 AndriiBihun

I have a similar issue:

NotImplementedError: Vlm do not work with prefix caching yet rank=0
2025-03-14T13:03:29.652543Z ERROR text_generation_launcher: Shard 0 failed to start
2025-03-14T13:03:29.652573Z  INFO text_generation_launcher: Shutting down shards

Even passing PREFIX_CACHE=0 via docker env doesn't help.

maziyarpanahi avatar Mar 14 '25 13:03 maziyarpanahi

Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯

Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:

$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4
[...]
2025-03-17T09:48:59.871425Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T09:48:59.871451Z  INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T09:48:59.871455Z  INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
[...]

danieldk avatar Mar 17 '25 09:03 danieldk

Could you give more information how you are starting TGI

In project im using docker compose, but for reproduction I test directly with docker run. Im not sure there is difference between running in docker and tgi launcher itself, but this is not working 👇

docker run -d --name tgi
--gpus all
-e MODEL_ID=google/gemma-3-27b-it
-e HF_HOME=/HF_CACHE
-p 127.0.0.1:8080:8080
-v "/home/huggningface_cache/:/HF_CACHE"
ghcr.io/huggingface/text-generation-inference:3.2.0

AndriiBihun avatar Mar 17 '25 10:03 AndriiBihun

Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯

Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:

$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4
[...]
2025-03-17T09:48:59.871425Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T09:48:59.871451Z  INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T09:48:59.871455Z  INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
[...]

Thanks, I am also starting it similarly. I've never mentioned anything regarding prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I pass PREFIX_CACHE=0 via -e into the docker.

How I launch:

docker run \
  --name "${NAME}" \
  --gpus all \
  --shm-size 8g \
  -p $PORT:80 \
  -e HUGGING_FACE_HUB_TOKEN=... \
  -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
  -e HF_HUB_OFFLINE=0 \
  -e TRUST_REMOTE_CODE=true \
  -v $volume:/data \
  --detach \
  ghcr.io/huggingface/text-generation-inference:3.2.0 \
  --model-id $model_id \
  $revision_flag \
  --sharded $SHARDED \
  $quantize_flag \
  --num-shard $num_shard \
  --cuda-memory-fraction=$cuda_fraction \
  $rope_flag \
  --max-input-length=$MAX_TOKEN_LENGTH \
  --max-total-tokens=$MAX_TOTAL_TOKENS \

maziyarpanahi avatar Mar 17 '25 14:03 maziyarpanahi

Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯

Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:

$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4
[...]
2025-03-17T09:48:59.871425Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T09:48:59.871451Z  INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T09:48:59.871455Z  INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
[...]

Thanks, I am also starting it similarly. I've never mentioned anything regarding prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I pass PREFIX_CACHE=0 via -e into the docker.

How I launch:

docker run \
  --name "${NAME}" \
  --gpus all \
  --shm-size 8g \
  -p $PORT:80 \
  -e HUGGING_FACE_HUB_TOKEN=... \
  -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
  -e HF_HUB_OFFLINE=0 \
  -e TRUST_REMOTE_CODE=true \
  -v $volume:/data \
  --detach \
  ghcr.io/huggingface/text-generation-inference:3.2.0 \
  --model-id $model_id \
  $revision_flag \
  --sharded $SHARDED \
  $quantize_flag \
  --num-shard $num_shard \
  --cuda-memory-fraction=$cuda_fraction \
  $rope_flag \
  --max-input-length=$MAX_TOKEN_LENGTH \
  --max-total-tokens=$MAX_TOTAL_TOKENS \

Hi guys, try passing PREFIX_CACHING=0 instead of PREFIX_CACHE=0.

EgorSWEB avatar Mar 18 '25 13:03 EgorSWEB

Simply set PREFIX_CACHE=0 in env. Why doesnt it specified anywhere?🤯

Because prefix caching should be disabled automatically for VLMs. Could you give more information how you are starting TGI, since I cannot reproduce this locally:

$ text-generation-launcher --model-id google/gemma-3-27b-it --num-shard 4
[...]
2025-03-17T09:48:59.871425Z  INFO text_generation_launcher: Disabling prefix caching because of VLM model
2025-03-17T09:48:59.871451Z  INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2025-03-17T09:48:59.871455Z  INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
[...]

Thanks, I am also starting it similarly. I've never mentioned anything regarding prefix caching. By default it seems this is disabled already, so I am a bit puzzled how it's enabled and why isn't getting disabled when I pass PREFIX_CACHE=0 via -e into the docker. How I launch:

docker run \
  --name "${NAME}" \
  --gpus all \
  --shm-size 8g \
  -p $PORT:80 \
  -e HUGGING_FACE_HUB_TOKEN=... \
  -e CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
  -e HF_HUB_OFFLINE=0 \
  -e TRUST_REMOTE_CODE=true \
  -v $volume:/data \
  --detach \
  ghcr.io/huggingface/text-generation-inference:3.2.0 \
  --model-id $model_id \
  $revision_flag \
  --sharded $SHARDED \
  $quantize_flag \
  --num-shard $num_shard \
  --cuda-memory-fraction=$cuda_fraction \
  $rope_flag \
  --max-input-length=$MAX_TOKEN_LENGTH \
  --max-total-tokens=$MAX_TOTAL_TOKENS \

Hi guys, try passing PREFIX_CACHING=0 instead of PREFIX_CACHE=0.

Thanks @EgorSWEB - that worked!

I can see there is a logic here to set the value dynamically, but I think it fails to recognize the model is a VLM so automatically set the value to 0.

https://github.com/huggingface/text-generation-inference/blob/e497bc09f6107baae4f06d6d31fc18730d0970c3/server/text_generation_server/models/globals.py#L10

maziyarpanahi avatar Mar 18 '25 15:03 maziyarpanahi

It seems the reason TGI's failing to recognize the model as VLM is due to an issue when HF_HUB_OFFLINE=1 is set.

I think there's a mismatch between how the launcher and router try to resolve the config when in offline mode. For example, I think the reason the launcher tries to set prefix caching true is because:

  • get_config() by default tries to use the ApiBuilder which reaches out to the Hub, and since there's no token env var, the request fails: Err(RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/google/gemma-3-4b-it/resolve/main/config.json])))
  • This then sets the Config here to None (when really it should raise an error)
  • Since config is None, the resolve_attention function is skipping over the if config.vision_config.is_some() completely, and thus sets attention to "flashinfer" and prefix_aching to "true"

It seems this is somehow related to this other issue.

In the meantime, manually setting PREFIX_CACHING=0 should work. cc. @danieldk

andrewrreed avatar Mar 24 '25 13:03 andrewrreed

Thanks @andrewrreed, perfect observation! It makes sense. I have started using the models successfully by setting PREFIX_CACHING=0 for now. Appreciate the help.

maziyarpanahi avatar Mar 24 '25 14:03 maziyarpanahi