plt12138
plt12138
I got the same problem with the v0.8.0 tag version. GPU 4090 * 4, Mixtral 8x7B int4.
`--gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --strongly_typed ` is not help : https://github.com/NVIDIA/TensorRT-LLM/issues/1273
I am not sure if there is a problem with the parameters when I build the engine or if the benchmark.py has not updated with the version. Alse see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330
> There are two different ways of running models: python and cpp. `run.py` decides between the two here: > > https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393 > > The python way seems to be very...
> please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding Yes, the issue is resolved. Thanks.
Same error. Triton: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 tensorrtllm_backend: v0.8.0 Mixtral-8x7b
TensorRT-LLm : v0.8.0
> I have found that the inference speed of FP16 Mistral is not very fast. I am using an H100 machine, and its speed is far below expectations. How is...