[model support] please support mamba-codestral-7B-v0.1
https://mistral.ai/news/codestral-mamba/
You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.
Unfortunately, this doesn't work
File "/home/jet/github/TensorRT-LLM/examples/mamba/convert_checkpoint.py", line 302, in main hf_config, mamba_version = load_config_hf(args.model_dir) File "/home/jet/github/TensorRT-LLM/examples/mamba/convert_checkpoint.py", line 260, in load_config_hf config = json.load(open(resolved_archive_file)) TypeError: expected str, bytes or os.PathLike object, not NoneType
It already supports it. Use the mamba conv1d plugin.
Now we can support Mamba2 model with the HF Mamba2 config format: https://huggingface.co/state-spaces/mamba2-2.7b/blob/main/config.json. For the mamba-codestral-7B-v0.1, you can create a new config.json from the existing params.json and make it similar to the HF Mamba2 config format. And also change the tensor name in codestral checkpoint to align with the HF Mamba2 checkpoints. Then it can work.
We will have a fix to directly support mamba-codestral-7B-v0.1 checkpoint soon.
We added a mamba-codestral-7B-v0.1 exampel in today's update. Please refer to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba and have a try.
We added a mamba-codestral-7B-v0.1 exampel in today's update. Please refer to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba and have a try.
cannot install tensorrt_llm==0.12.0.dev2024072301
cannot install tensorrt_llm==0.12.0.dev2024072301
You need to reinstall tensorrt_llm.
cannot install tensorrt_llm==0.12.0.dev2024072301
You need to reinstall tensorrt_llm.
convert ok, but trtllm-build failed
[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layer_types = ['recurrent']
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rms_norm = True
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.residual_in_fp32 = True
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.pad_vocab_size_multiple = 1
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_hidden_size = 8192
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_conv_dim_size = 10240
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.state_size = 128
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.conv_kernel = 4
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_bias = False
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.mamba_version = Mamba2
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_head_size = 64
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ngroups = 8
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.chunk_size = 256
[07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ssm_rmsnorm = True
[07/25/2024-14:40:30] [TRT-LLM] [I] Compute capability: (8, 9)
[07/25/2024-14:40:30] [TRT-LLM] [I] SM count: 128
[07/25/2024-14:40:30] [TRT-LLM] [I] SM clock: 3120 MHz
[07/25/2024-14:40:30] [TRT-LLM] [I] int4 TFLOPS: 817
[07/25/2024-14:40:30] [TRT-LLM] [I] int8 TFLOPS: 408
[07/25/2024-14:40:30] [TRT-LLM] [I] fp8 TFLOPS: 408
[07/25/2024-14:40:30] [TRT-LLM] [I] float16 TFLOPS: 204
[07/25/2024-14:40:30] [TRT-LLM] [I] bfloat16 TFLOPS: 204
[07/25/2024-14:40:30] [TRT-LLM] [I] float32 TFLOPS: 102
[07/25/2024-14:40:30] [TRT-LLM] [I] Total Memory: 23 GiB
[07/25/2024-14:40:30] [TRT-LLM] [I] Memory clock: 10501 MHz
[07/25/2024-14:40:30] [TRT-LLM] [I] Memory bus width: 384
[07/25/2024-14:40:30] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s
[07/25/2024-14:40:30] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/25/2024-14:40:30] [TRT-LLM] [I] PCIe link width: 16
[07/25/2024-14:40:30] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
Traceback (most recent call last):
File "/home/jet/miniforge3/envs/tensorrt-llm/bin/trtllm-build", line 8, in
File "/home/jet/miniforge3/envs/tensorrt-llm/lib/python3.10/site-packages/tensorrt_llm/plugin/plugin.py", line 79, in prop
field_value = getattr(self, storage_name)
AttributeError: 'PluginConfig' object has no attribute '_streamingllm'. Did you mean: '_streamingllm'?
I cannot reproduce this error. Can you share your command?
https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba
sorry, i start a new python env and it works. thx for that , i will close the issue.
Are there plans to support tp>1 @lfr-0531?
Are there plans to support tp>1 @lfr-0531?
Coming soon.
How can I deploy a Mamba-based classifier, a PyTorch model that has Mamba and replaces final layer? Is it enough to modify convert_checkpoint.py, or will I need to modify something else? When I run triton server, I can't seem to find a working config.pbtxt, any pointers what parameters mamba require (I found I have to specify both gpt_model_path and engine_dir, which seems strange)?