bleedingfight comments

Results 11 comments of


                                            bleedingfight

Why is the calculation result of tensorrt-llm version llava1.5 different from the output of HF?

@byshiue I have seen a significant decrease in the accuracy of the output results of TRT on my test set. I would like to know how you have determined that...

Why is the calculation result of tensorrt-llm version llava1.5 different from the output of HF?

@byshiue Thank you very much for your reply. I'm sorry for the delayed reply. In the past few days, I have been trying to provide a Docker and minimum reproduction...

Why is the calculation result of tensorrt-llm version llava1.5 different from the output of HF?

@byshiue oI modified temperature=1e-6 according to your statement, but I found that errors occur at all times except for the first inference that produces output，self.tokenizer.batch_decode(output_ids[0, :, input_lengths[0] :])will produce error:...

Why is the calculation result of tensorrt-llm version llava1.5 different from the output of HF?

@byshiue [here](https://drive.google.com/file/d/1sVB2PmawwY9810s24pWGCaztAUyFHluf/view?usp=sharing)

Why is the calculation result of tensorrt-llm version llava1.5 different from the output of HF?

@byshiue readme.md in zip.I just start webserver for llava-trt to process the result

llama awq4 result is wrong!

@Barry-Delaney What I want to know is how to construct this dataset? My training data is a image text pair, but I only quantify LLM, so theoretically, I should only...

llama awq4 result is wrong!

> I also face the same issue. the reason may as follows: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/llm_ptq/README.md#model-support-list thanks for your reply，I have already used VLLM AutoAWQ。

llama awq4 result is wrong!

> [@bleedingfight](https://github.com/bleedingfight) , thank you for the update. Just to confirm my understanding: after using vLLM with AutoAWQ and the same models, you’re no longer seeing the issue you reported...

tensorrt-llm llama3 slower then vllm(4bit quant)?

@QiJune My model is a multimodal model, which is slightly different from pure LLM. The difference is that the input of LLM is not input_ids, but a relatively long input_embeds....

Does DeepEP support sm86?(4A6000 without NVL)

@BarrinXu how to replace it？ ```cpp void Buffer::low_latency_query_mask_buffer(const torch::Tensor& mask_status) { #ifndef DISABLE_NVSHMEM EP_HOST_ASSERT(mask_buffer_ptr != nullptr and "Shrink mode must be enabled"); EP_HOST_ASSERT(mask_status.numel() == num_ranks && mask_status.scalar_type() == torch::kInt32); internode_ll::query_mask_buffer(...