TensorRT-LLM [model support] please support mamba-codestral-7B-v0.1

https://mistral.ai/news/codestral-mamba/

You can deploy Codestral Mamba using the mistral-inference SDK, which relies on the reference implementations from Mamba’s GitHub repository. The model can also be deployed through TensorRT-LLM. For local inference, keep an eye out for support in llama.cpp. You may download the raw weights from HuggingFace.

Unfortunately, this doesn't work

File "/home/jet/github/TensorRT-LLM/examples/mamba/convert_checkpoint.py", line 302, in main hf_config, mamba_version = load_config_hf(args.model_dir) File "/home/jet/github/TensorRT-LLM/examples/mamba/convert_checkpoint.py", line 260, in load_config_hf config = json.load(open(resolved_archive_file)) TypeError: expected str, bytes or os.PathLike object, not NoneType

Jul 17 '24 11:07 mofanke

It already supports it. Use the mamba conv1d plugin.

Jul 17 '24 22:07 avianion

Now we can support Mamba2 model with the HF Mamba2 config format: https://huggingface.co/state-spaces/mamba2-2.7b/blob/main/config.json. For the mamba-codestral-7B-v0.1, you can create a new config.json from the existing params.json and make it similar to the HF Mamba2 config format. And also change the tensor name in codestral checkpoint to align with the HF Mamba2 checkpoints. Then it can work.

We will have a fix to directly support mamba-codestral-7B-v0.1 checkpoint soon.

Jul 18 '24 01:07 lfr-0531

We added a mamba-codestral-7B-v0.1 exampel in today's update. Please refer to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba and have a try.

Jul 24 '24 02:07 lfr-0531

We added a mamba-codestral-7B-v0.1 exampel in today's update. Please refer to https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba and have a try.

cannot install tensorrt_llm==0.12.0.dev2024072301

Jul 24 '24 14:07 mofanke

cannot install tensorrt_llm==0.12.0.dev2024072301

You need to reinstall tensorrt_llm.

Jul 25 '24 02:07 lfr-0531

cannot install tensorrt_llm==0.12.0.dev2024072301

You need to reinstall tensorrt_llm.

convert ok, but trtllm-build failed

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024072301 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.layer_types = ['recurrent'] [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rms_norm = True [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.residual_in_fp32 = True [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.pad_vocab_size_multiple = 1 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_hidden_size = 8192 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_conv_dim_size = 10240 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.state_size = 128 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.conv_kernel = 4 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.use_bias = False [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.mamba_version = Mamba2 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rnn_head_size = 64 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ngroups = 8 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.chunk_size = 256 [07/25/2024-14:40:30] [TRT-LLM] [W] Implicitly setting PretrainedConfig.ssm_rmsnorm = True [07/25/2024-14:40:30] [TRT-LLM] [I] Compute capability: (8, 9) [07/25/2024-14:40:30] [TRT-LLM] [I] SM count: 128 [07/25/2024-14:40:30] [TRT-LLM] [I] SM clock: 3120 MHz [07/25/2024-14:40:30] [TRT-LLM] [I] int4 TFLOPS: 817 [07/25/2024-14:40:30] [TRT-LLM] [I] int8 TFLOPS: 408 [07/25/2024-14:40:30] [TRT-LLM] [I] fp8 TFLOPS: 408 [07/25/2024-14:40:30] [TRT-LLM] [I] float16 TFLOPS: 204 [07/25/2024-14:40:30] [TRT-LLM] [I] bfloat16 TFLOPS: 204 [07/25/2024-14:40:30] [TRT-LLM] [I] float32 TFLOPS: 102 [07/25/2024-14:40:30] [TRT-LLM] [I] Total Memory: 23 GiB [07/25/2024-14:40:30] [TRT-LLM] [I] Memory clock: 10501 MHz [07/25/2024-14:40:30] [TRT-LLM] [I] Memory bus width: 384 [07/25/2024-14:40:30] [TRT-LLM] [I] Memory bandwidth: 1008 GB/s [07/25/2024-14:40:30] [TRT-LLM] [I] PCIe speed: 2500 Mbps [07/25/2024-14:40:30] [TRT-LLM] [I] PCIe link width: 16 [07/25/2024-14:40:30] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s Traceback (most recent call last): File "/home/jet/miniforge3/envs/tensorrt-llm/bin/trtllm-build", line 8, in sys.exit(main()) File "/home/jet/miniforge3/envs/tensorrt-llm/lib/python3.10/site-packages/tensorrt_llm/commands/build.py", line 476, in main if not plugin_config.streamingllm and model_config.max_position_embeddings is not None
File "/home/jet/miniforge3/envs/tensorrt-llm/lib/python3.10/site-packages/tensorrt_llm/plugin/plugin.py", line 79, in prop field_value = getattr(self, storage_name) AttributeError: 'PluginConfig' object has no attribute '_streamingllm'. Did you mean: '_streamingllm'?

Jul 25 '24 06:07 mofanke

I cannot reproduce this error. Can you share your command?

Jul 25 '24 13:07 lfr-0531

https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/mamba

sorry, i start a new python env and it works. thx for that , i will close the issue.

Jul 26 '24 13:07 mofanke

Are there plans to support tp>1 @lfr-0531?

Aug 07 '24 04:08 michaelroyzen

Are there plans to support tp>1 @lfr-0531?

Coming soon.

Aug 07 '24 08:08 lfr-0531

How can I deploy a Mamba-based classifier, a PyTorch model that has Mamba and replaces final layer? Is it enough to modify convert_checkpoint.py, or will I need to modify something else? When I run triton server, I can't seem to find a working config.pbtxt, any pointers what parameters mamba require (I found I have to specify both gpt_model_path and engine_dir, which seems strange)?

Apr 15 '25 04:04 migdalskiy