TensorRT-LLM How to build the Mistral using BF16

I want to build the Mistral model using a AWQ and BF16.

python3 ../quantization/quantize.py --model_dir dolphin-2.6-mistral-7b-sft-yhy   \ 
        --dtype bfloat16   --qformat int4_awq    --awq_block_size 128  \
        --output_dir ./quantized_int4-awq-bf16      --calib_size 32

At first I used this command build:

trtllm-build --checkpoint_dir ./quantized_int4-awq-bf16  --output_dir ./trt_engines/int4_AWQ/1-gpu/  \
       --gemm_plugin bfloat16  --gpt_attention_plugin bfloat16  \
       --max_batch_size 8  --max_input_len 16384 --max_output_len 4096

Get error:

...
[03/12/2024-07:19:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[03/12/2024-07:19:07] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[03/12/2024-07:19:07] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[03/12/2024-07:19:07] [TRT] [W] Unused Input: position_ids
[03/12/2024-07:19:07] [TRT] [E] 4: [network.cpp::validate::3516] Error Code 4: Internal Error (fp16 precision has been set for a layer or layer output, but fp16 is not configured in the builder)
[03/12/2024-07:19:07] [TRT-LLM] [E] Engine building failed, please check the error log.
[03/12/2024-07:19:07] [TRT-LLM] [I] Serializing engine to ./trt_engines/int4_AWQ/1-gpu-bf16/rank0.engine...
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 497, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 420, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 399, in build_and_save
    engine.save(output_dir)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/engine.py", line 60, in save
    serialize_engine(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/engine.py", line 18, in serialize_engine
    f.write(engine)
TypeError: a bytes-like object is required, not 'NoneType'

Then I try to add --strongly_typed but still failed:

trtllm-build --checkpoint_dir ./quantized_int4-awq-bf16/  --output_dir ./trt_engines/int4_AWQ/1-gpu-bf16/ \
        --gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --max_batch_size 8  \
        --max_input_len 16384 --max_output_len 4096  \
        --strongly_typed

[TensorRT-LLM] TensorRT-LLM version: 0.8.0[03/12/2024-04:06:34] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set lookup_plugin to None.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set lora_plugin to None.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set context_fmha to True.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set paged_kv_cache to True.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set remove_input_padding to True.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set multi_block_mode to False.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set enable_xqa to True.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set tokens_per_block to 128.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[03/12/2024-04:06:34] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[03/12/2024-04:06:34] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_size*max_input_len. 
It may not be optimal to set max_num_tokens=max_batch_size*max_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[03/12/2024-04:07:00] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 4061, GPU 262 (MiB)
[03/12/2024-04:07:02] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1799, GPU +312, now: CPU 5996, GPU 574 (MiB)
[03/12/2024-04:07:02] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to bfloat16.
[03/12/2024-04:07:02] [TRT-LLM] [I] Set nccl_plugin to None.
[03/12/2024-04:07:02] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[03/12/2024-04:07:02] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[03/12/2024-04:07:02] [TRT] [W] Unused Input: position_ids
[03/12/2024-04:07:02] [TRT] [W] Detected layernorm nodes in FP16.
[03/12/2024-04:07:02] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[03/12/2024-04:07:02] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[03/12/2024-04:07:02] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6024, GPU 596 (MiB)
[03/12/2024-04:07:02] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 6026, GPU 606 (MiB)
[03/12/2024-04:07:02] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[03/12/2024-04:07:02] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[03/12/2024-04:07:02] [TRT] [E] 9: LLaMAForCausalLM/transformer/layers/0/attention/qkv/PLUGIN_V2_WeightOnlyGroupwiseQuantMatmul_0: could not find any supported formats consistent with input/output data types
[03/12/2024-04:07:02] [TRT] [E] 9: [pluginV2Builder.cpp::reportPluginError::24] Error Code 9: Internal Error (LLaMAForCausalLM/transformer/layers/0/attention/qkv/PLUGIN_V2_WeightOnlyGroupwiseQuantMatmul_0: could not find any supported formats consistent with input/output data types)
[03/12/2024-04:07:02] [TRT-LLM] [E] Engine building failed, please check the error log.
[03/12/2024-04:07:03] [TRT-LLM] [I] Serializing engine to ./trt_engines/int4_AWQ/1-gpu-bf16/rank0.engine...
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-build", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 497, in main
    parallel_build(source, build_config, args.output_dir, workers,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 420, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 399, in build_and_save
    engine.save(output_dir)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/engine.py", line 60, in save
    serialize_engine(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/engine.py", line 18, in serialize_engine
    f.write(engine)
TypeError: a bytes-like object is required, not 'NoneType'

Mar 12 '24 07:03 plt12138

TensorRT-LLm : v0.8.0

Mar 12 '24 09:03 plt12138

I have found that the inference speed of FP16 Mistral is not very fast. I am using an H100 machine, and its speed is far below expectations. How is the inference speed of your system?

Mar 13 '24 09:03 BaiMoHan

I have found that the inference speed of FP16 Mistral is not very fast. I am using an H100 machine, and its speed is far below expectations. How is the inference speed of your system?

I tested it on single 3090. According to the example in llama Readme.md Mistral fp16 output 50 tokens/s. Many options will affect performance like --max_batch_size --max_input_len --max_output_len and quantization mode. https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/examples/llama/README.md#mistral-v01 This is the official performance data： https://github.com/NVIDIA/TensorRT-LLM/blob/v0.8.0/docs/source/performance.md

Mar 14 '24 03:03 plt12138

Hi @plt12138, it is a known bug in v0.8.0 release. It has been fixed in the recent main branch. Could you, please, try it?

Mar 14 '24 22:03 nekorobov

Hi @plt12138, it is a known bug in v0.8.0 release. It has been fixed in the recent main branch. Could you, please, try it?

TensorRT-LLm :0.9.0.dev2024031900 Confirmed, it didn't work with only --strongly_typed and produced the same error. However, when using both --gpt_attention_plugin bfloat16 and --strongly_typed at the same time, it worked.

Mar 26 '24 02:03 jaywongs

Hello, would you mind spending some time testing the parameter length_penalty? In my case, the parameter length_penalty doesn't make sense in Mistral. I'm not sure if the bug is due to my code or TensorRT.🥺🥺🥺

Mar 26 '24 07:03 BaiMoHan

Hello, would you mind spending some time testing the parameter length_penalty? In my case, the parameter length_penalty doesn't make sense in Mistral. I'm not sure if the bug is due to my code or TensorRT.🥺🥺🥺

Sorry, I am not using Mistral, but the Code Llama 70b. There may be some differences between these two models in TensorRT.

Mar 26 '24 08:03 jaywongs

hi @plt12138 do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 07:11 nv-guomingz