plt12138 comments

Results 9 comments of


                                            plt12138

"peer access is not supported between these two devices" when using multiple GPUs

I got the same problem with the v0.8.0 tag version. GPU 4090 * 4, Mixtral 8x7B int4.

Error while building Finetuned Mistral model using trt-llm

`--gemm_plugin bfloat16 --gpt_attention_plugin bfloat16 --strongly_typed ` is not help : https://github.com/NVIDIA/TensorRT-LLM/issues/1273

v0.8.0 KeyError: 'builder_config' when benchmarking with new versions config.json

I am not sure if there is a problem with the parameters when I build the engine or if the benchmark.py has not updated with the version. Alse see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330

v0.8.0 KeyError: 'builder_config' when benchmarking with new versions config.json

> There are two different ways of running models: python and cpp. `run.py` decides between the two here: > > https://github.com/NVIDIA/TensorRT-LLM/blob/728cc0044bb76d1fafbcaa720c403e8de4f81906/examples/run.py#L393 > > The python way seems to be very...

v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model_name}' is not ready.

> please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding Yes, the issue is resolved. Thanks.

No white space included in tokens sent back by Llama2 in streaming mode

mark

Input tensor 'host_sink_token_length' not found when launch llama2-7b.

Same error. Triton: nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 tensorrtllm_backend: v0.8.0 Mixtral-8x7b

How to build the Mistral using BF16

TensorRT-LLm : v0.8.0

How to build the Mistral using BF16

> I have found that the inference speed of FP16 Mistral is not very fast. I am using an H100 machine, and its speed is far below expectations. How is...