plt12138

Results 9 comments of plt12138

I got the same problem with the v0.8.0 tag version. GPU 4090 * 4, Mixtral 8x7B int4.

`--gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --strongly_typed ` is not help : https://github.com/NVIDIA/TensorRT-LLM/issues/1273

I am not sure if there is a problem with the parameters when I build the engine or if the benchmark.py has not updated with the version. Alse see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330

> There are two different ways of running models: python and cpp. `run.py` decides between the two here: > > https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393 > > The python way seems to be very...

> please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding Yes, the issue is resolved. Thanks.

Same error. Triton: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 tensorrtllm_backend: v0.8.0 Mixtral-8x7b

> I have found that the inference speed of FP16 Mistral is not very fast. I am using an H100 machine, and its speed is far below expectations. How is...