TGI latest cpu version doesn't work with some models
After updated tgi version to ghcr.io/huggingface/text-generation-inference:latest-intel-cpu The codegen test failed with the following 2 MODELs: ise-uiuc/Magicoder-S-DS-6.7B m-a-p/OpenCodeInterpreter-DS-6.7B
The later one is mentioned in the readme file of CodeGen: https://github.com/opea-project/GenAIExamples/tree/main/CodeGen
The default model(meta-llama/CodeLlama-7b-hf) specified by docker-compose runs fine.
what is the issue you are facing, can you please post error log from docker here.
I'm using helm install to test: https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/common/tgi Using command like this: helm install tgi tgi --set LLM_MODEL_ID=ise-uiuc/Magicoder-S-DS-6.7B
Error message/pod logs:
{"timestamp":"2024-08-19T05:38:39.361300Z","level":"INFO","fields":{"message":"Args {\n model_id: "ise-uiuc/Magicoder-S-DS-6.7B",\n revision: None,\n validation_workers: 2,\n sharded: None,\n num_shard: None,\n quantize: None,\n speculate: None,\n dtype: None,\n trust_remote_code: false,\n max_concurrent_requests: 128,\n max_best_of: 2,\n max_stop_sequences: 4,\n max_top_n_tokens: 5,\n max_input_tokens: None,\n max_input_length: None,\n max_total_tokens: None,\n waiting_served_ratio: 0.3,\n max_batch_prefill_tokens: None,\n max_batch_total_tokens: None,\n max_waiting_tokens: 20,\n max_batch_size: None,\n cuda_graphs: None,\n hostname: "tgi-874bfcffc-c4wst",\n port: 2080,\n shard_uds_path: "/tmp/text-generation-server",\n master_addr: "localhost",\n master_port: 29500,\n huggingface_hub_cache: Some(\n "/data",\n ),\n weights_cache_override: None,\n disable_custom_kernels: false,\n cuda_memory_fraction: 1.0,\n rope_scaling: None,\n rope_factor: None,\n json_output: true,\n otlp_endpoint: None,\n otlp_service_name: "text-generation-inference.router",\n cors_allow_origin: [],\n api_key: None,\n watermark_gamma: None,\n watermark_delta: None,\n ngrok: false,\n ngrok_authtoken: None,\n ngrok_edge: None,\n tokenizer_config_path: None,\n disable_grammar_support: false,\n env: false,\n max_client_batch_size: 4,\n lora_adapters: None,\n usage_stats: On,\n}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361458Z","level":"INFO","fields":{"message":"Token file not found "/tmp/.cache/huggingface/token"","log.target":"hf_hub","log.module_path":"hf_hub","log.file":"/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs","log.line":55},"target":"hf_hub"}
{"timestamp":"2024-08-19T05:38:39.361623Z","level":"INFO","fields":{"message":"Model supports up to 16384 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using --max-batch-prefill-tokens=16434 --max-total-tokens=16384 --max-input-tokens=16383."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361636Z","level":"INFO","fields":{"message":"Default max_input_tokens to 4095"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361640Z","level":"INFO","fields":{"message":"Default max_total_tokens to 4096"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361643Z","level":"INFO","fields":{"message":"Default max_batch_prefill_tokens to 4145"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361648Z","level":"INFO","fields":{"message":"Using default cuda graphs [1, 2, 4, 8, 16, 32]"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:39.361854Z","level":"INFO","fields":{"message":"Starting check and download process for ise-uiuc/Magicoder-S-DS-6.7B"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-19T05:38:42.469115Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:43.169166Z","level":"INFO","fields":{"message":"Successfully downloaded weights for ise-uiuc/Magicoder-S-DS-6.7B"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2024-08-19T05:38:43.169575Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-19T05:38:46.051416Z","level":"WARN","fields":{"message":"FBGEMM fp8 kernels are not installed."},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.070139Z","level":"INFO","fields":{"message":"Using Attention = False"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.070193Z","level":"INFO","fields":{"message":"Using Attention = paged"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.123324Z","level":"WARN","fields":{"message":"Could not import Mamba: No module named 'mamba_ssm'"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.294082Z","level":"INFO","fields":{"message":"affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47}, membind = {0}"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:46.662238Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File "/opt/conda/bin/text-generation-server", line 8, in TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.\n warnings.warn(\n2024-08-19 05:38:45.737 | INFO | text_generation_server.utils.import_utils:dtype and quantize, as they │\n│ 108 │ │ ) │\n│ ❱ 109 │ server.serve( │\n│ 110 │ │ model_id, │\n│ 111 │ │ lora_adapters, │\n│ 112 │ │ revision, │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ dtype = None │ │\n│ │ json_output = True │ │\n│ │ logger_level = 'INFO' │ │\n│ │ lora_adapters = [] │ │\n│ │ max_input_tokens = 4095 │ │\n│ │ model_id = 'ise-uiuc/Magicoder-S-DS-6.7B' │ │\n│ │ otlp_endpoint = None │ │\n│ │ otlp_service_name = 'text-generation-inference.router' │ │\n│ │ quantize = None │ │\n│ │ revision = None │ │\n│ │ server = <module 'text_generation_server.server' from │ │\n│ │ '/opt/conda/lib/python3.10/site-packages/text_gener… │ │\n│ │ setup_tracing = <function setup_tracing at 0x7f9f4843c9d0> │ │\n│ │ sharded = False │ │\n│ │ speculate = None │ │\n│ │ trust_remote_code = False │ │\n│ │ uds_path = PosixPath('/tmp/text-generation-server') │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n│ │\n│ /opt/conda/lib/python3.10/site-packages/text_generation_server/server.py:274 │\n│ in serve │\n│ │\n│ 271 │ │ while signal_handler.KEEP_PROCESSING: │\n│ 272 │ │ │ await asyncio.sleep(0.5) │\n│ 273 │ │\n│ ❱ 274 │ asyncio.run( │\n│ 275 │ │ serve_inner( │\n│ 276 │ │ │ model_id, │\n│ 277 │ │ │ lora_adapters, │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ dtype = None │ │\n│ │ lora_adapters = [] │ │\n│ │ max_input_tokens = 4095 │ │\n│ │ model_id = 'ise-uiuc/Magicoder-S-DS-6.7B' │ │\n│ │ quantize = None │ │\n│ │ revision = None │ │\n│ │ serve_inner = <function serve.ret │\n│ 770 │ │ config_dict[\"attn_implementation\"] = kwargs.pop(\"attn_impleme │\n│ 771 │ │ │\n│ ❱ 772 │ │ config = cls(**config_dict) │\n│ 773 │ │ │\n│ 774 │ │ if hasattr(config, \"pruned_heads\"): │\n│ 775 │ │ │ config.pruned_heads = {int(key): value for key, value in │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ cls = <class │ │\n│ │ 'transformers.models.llama.configuration_llama.L… │ │\n│ │ config_dict = { │ │\n│ │ │ '_name_or_path': │ │\n│ │ 'ise-uiuc/Magicoder-S-DS-6.7B', │ │\n│ │ │ 'architectures': ['LlamaForCausalLM'], │ │\n│ │ │ 'attention_bias': False, │ │\n│ │ │ 'attention_dropout': 0.0, │ │\n│ │ │ 'bos_token_id': 32013, │ │\n│ │ │ 'eos_token_id': 32014, │ │\n│ │ │ 'hidden_act': 'silu', │ │\n│ │ │ 'hidden_size': 4096, │ │\n│ │ │ 'initializer_range': 0.02, │ │\n│ │ │ 'intermediate_size': 11008, │ │\n│ │ │ ... +16 │ │\n│ │ } │ │\n│ │ kwargs = {'name_or_path': 'ise-uiuc/Magicoder-S-DS-6.7B'} │ │\n│ │ return_unused_kwargs = False │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n│ │\n│ /opt/conda/lib/python3.10/site-packages/transformers/models/llama/configurat │\n│ ion_llama.py:192 in __init__ │\n│ │\n│ 189 │ │ self.mlp_bias = mlp_bias │\n│ 190 │ │ │\n│ 191 │ │ # Validate the correctness of rotary position embeddings param │\n│ ❱ 192 │ │ rope_config_validation(self) │\n│ 193 │ │ │\n│ 194 │ │ super().__init__( │\n│ 195 │ │ │ pad_token_id=pad_token_id, │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ attention_bias = False │ │\n│ │ attention_dropout = 0.0 │ │\n│ │ bos_token_id = 32013 │ │\n│ │ eos_token_id = 32014 │ │\n│ │ hidden_act = 'silu' │ │\n│ │ hidden_size = 4096 │ │\n│ │ initializer_range = 0.02 │ │\n│ │ intermediate_size = 11008 │ │\n│ │ kwargs = { │ │\n│ │ │ '_name_or_path': │ │\n│ │ 'ise-uiuc/Magicoder-S-DS-6.7B', │ │\n│ │ │ 'architectures': ['LlamaForCausalLM'], │ │\n│ │ │ 'model_type': 'llama', │ │\n│ │ │ 'torch_dtype': 'float32', │ │\n│ │ │ 'transformers_version': '4.36.0.dev0', │ │\n│ │ │ '_commit_hash': │ │\n│ │ 'b3ed7cb1578a3643ceaf2ebf996a3d8e85f75d8f', │ │\n│ │ │ 'attn_implementation': None │ │\n│ │ } │ │\n│ │ max_position_embeddings = 16384 │ │\n│ │ mlp_bias = False │ │\n│ │ num_attention_heads = 32 │ │\n│ │ num_hidden_layers = 32 │ │\n│ │ num_key_value_heads = 32 │ │\n│ │ pad_token_id = None │ │\n│ │ pretraining_tp = 1 │ │\n│ │ rms_norm_eps = 1e-06 │ │\n│ │ rope_scaling = {'factor': 4.0, 'type': 'linear'} │ │\n│ │ rope_theta = 100000 │ │\n│ │ self = LlamaConfig { │ │\n│ │ \"attention_bias\": false, │ │\n│ │ \"attention_dropout\": 0.0, │ │\n│ │ \"hidden_act\": \"silu\", │ │\n│ │ \"hidden_size\": 4096, │ │\n│ │ \"initializer_range\": 0.02, │ │\n│ │ \"intermediate_size\": 11008, │ │\n│ │ \"max_position_embeddings\": 16384, │ │\n│ │ \"mlp_bias\": false, │ │\n│ │ \"model_type\": \"llama\", │ │\n│ │ \"num_attention_heads\": 32, │ │\n│ │ \"num_hidden_layers\": 32, │ │\n│ │ \"num_key_value_heads\": 32, │ │\n│ │ \"pretraining_tp\": 1, │ │\n│ │ \"rms_norm_eps\": 1e-06, │ │\n│ │ \"rope_scaling\": { │ │\n│ │ │ \"factor\": 4.0, │ │\n│ │ │ \"type\": \"linear\" │ │\n│ │ }, │ │\n│ │ \"rope_theta\": 100000, │ │\n│ │ \"transformers_version\": \"4.43.1\", │ │\n│ │ \"use_cache\": true, │ │\n│ │ \"vocab_size\": 32256 │ │\n│ │ } │ │\n│ │ tie_word_embeddings = False │ │\n│ │ use_cache = True │ │\n│ │ vocab_size = 32256 │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n│ │\n│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_rope_utils.py: │\n│ 546 in rope_config_validation │\n│ │\n│ 543 │ │\n│ 544 │ validation_fn = ROPE_VALIDATION_FUNCTIONS.get(rope_type) │\n│ 545 │ if validation_fn is not None: │\n│ ❱ 546 │ │ validation_fn(config) │\n│ 547 │ else: │\n│ 548 │ │ raise ValueError( │\n│ 549 │ │ │ f\"Missing validation function mapping in ROPE_VALIDATION │\n│ │\n│ ╭───────────────────────────────── locals ─────────────────────────────────╮ │\n│ │ config = LlamaConfig { │ │\n│ │ "attention_bias": false, │ │\n│ │ "attention_dropout": 0.0, │ │\n│ │ "hidden_act": "silu", │ │\n│ │ "hidden_size": 4096, │ │\n│ │ "initializer_range": 0.02, │ │\n│ │ "intermediate_size": 11008, │ │\n│ │ "max_position_embeddings": 16384, │ │\n│ │ "mlp_bias": false, │ │\n│ │ "model_type": "llama", │ │\n│ │ "num_attention_heads": 32, │ │\n│ │ "num_hidden_layers": 32, │ │\n│ │ "num_key_value_heads": 32, │ │\n│ │ "pretraining_tp": 1, │ │\n│ │ "rms_norm_eps": 1e-06, │ │\n│ │ "rope_scaling": { │ │\n│ │ │ "factor": 4.0, │ │\n│ │ │ "type": "linear" │ │\n│ │ }, │ │\n│ │ "rope_theta": 100000, │ │\n│ │ "transformers_version": "4.43.1", │ │\n│ │ "use_cache": true, │ │\n│ │ "vocab_size": 32256 │ │\n│ │ } │ │\n│ │ possible_rope_types = { │ │\n│ │ │ 'longrope', │ │\n│ │ │ 'yarn', │ │\n│ │ │ 'default', │ │\n│ │ │ 'llama3', │ │\n│ │ │ 'linear', │ │\n│ │ │ 'dynamic' │ │\n│ │ } │ │\n│ │ rope_scaling = {'factor': 4.0, 'type': 'linear'} │ │\n│ │ rope_type = 'linear' │ │\n│ │ validation_fn = <function _validate_linear_scaling_rope_parameters │ │\n│ │ at 0x7f9f4821f250> │ │\n│ ╰──────────────────────────────────────────────────────────────────────────╯ │\n│ │\n│ /opt/conda/lib/python3.10/site-packages/transformers/modeling_rope_utils.py: │\n│ 379 in _validate_linear_scaling_rope_parameters │\n│ │\n│ 376 │\n│ 377 def _validate_linear_scaling_rope_parameters(config: PretrainedConfig) │\n│ 378 │ rope_scaling = config.rope_scaling │\n│ ❱ 379 │ rope_type = rope_scaling["rope_type"] │\n│ 380 │ required_keys = {"rope_type", "factor"} │\n│ 381 │ received_keys = set(rope_scaling.keys()) │\n│ 382 │ _check_received_keys(rope_type, received_keys, required_keys) │\n│ │\n│ ╭────────────────────── locals ──────────────────────╮ │\n│ │ config = LlamaConfig { │ │\n│ │ "attention_bias": false, │ │\n│ │ "attention_dropout": 0.0, │ │\n│ │ "hidden_act": "silu", │ │\n│ │ "hidden_size": 4096, │ │\n│ │ "initializer_range": 0.02, │ │\n│ │ "intermediate_size": 11008, │ │\n│ │ "max_position_embeddings": 16384, │ │\n│ │ "mlp_bias": false, │ │\n│ │ "model_type": "llama", │ │\n│ │ "num_attention_heads": 32, │ │\n│ │ "num_hidden_layers": 32, │ │\n│ │ "num_key_value_heads": 32, │ │\n│ │ "pretraining_tp": 1, │ │\n│ │ "rms_norm_eps": 1e-06, │ │\n│ │ "rope_scaling": { │ │\n│ │ │ "factor": 4.0, │ │\n│ │ │ "type": "linear" │ │\n│ │ }, │ │\n│ │ "rope_theta": 100000, │ │\n│ │ "transformers_version": "4.43.1", │ │\n│ │ "use_cache": true, │ │\n│ │ "vocab_size": 32256 │ │\n│ │ } │ │\n│ │ rope_scaling = {'factor': 4.0, 'type': 'linear'} │ │\n│ ╰────────────────────────────────────────────────────╯ │\n╰──────────────────────────────────────────────────────────────────────────────╯\nKeyError: 'rope_type'"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-19T05:38:48.374957Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-19T05:38:48.375003Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
Error: ShardCannotStart
After updated tgi version to ghcr.io/huggingface/text-generation-inference:latest-intel-cpu The codegen test failed with the following 2 MODELs: ise-uiuc/Magicoder-S-DS-6.7B m-a-p/OpenCodeInterpreter-DS-6.7B
The later one is mentioned in the readme file of CodeGen: https://github.com/opea-project/GenAIExamples/tree/main/CodeGen
latest-intel-cpu is mentioned now only in GitHub workflow:
GenAIExamples$ git grep latest-intel-cpu
.github/workflows/scripts/update_images_tag.sh:dict["ghcr.io/huggingface/text-generation-inference"]="docker://ghcr.io/huggingface/text-generation-inference:latest-intel-cpu"
=> can this be closed?
After updated tgi version to ghcr.io/huggingface/text-generation-inference:latest-intel-cpu The codegen test failed with the following 2 MODELs: ise-uiuc/Magicoder-S-DS-6.7B m-a-p/OpenCodeInterpreter-DS-6.7B The later one is mentioned in the readme file of CodeGen: https://github.com/opea-project/GenAIExamples/tree/main/CodeGen
latest-intel-cpuis mentioned now only in GitHub workflow:GenAIExamples$ git grep latest-intel-cpu .github/workflows/scripts/update_images_tag.sh:dict["ghcr.io/huggingface/text-generation-inference"]="docker://ghcr.io/huggingface/text-generation-inference:latest-intel-cpu"=> can this be closed?
At some specific time, the latest version failed. Now there is no such issues with current 'latest' version.
Close this one.
https://github.com/opea-project/GenAIExamples/blob/eb245fd085bd252f77ea922aa953928c7f98014e/.github/workflows/scripts/update_images_tag.sh#L12C9-L12C17
if it is Hugging Face repo_image:it's should be based on a comit "sha-abcdefg-intel-cpu"(commit SHA)。 otherwise, if it is a like version : "2.3.1"(latest tag)。
If we run into this issue again, we can update to a latest stable version/tag.