bleedingfight
bleedingfight
@byshiue I have seen a significant decrease in the accuracy of the output results of TRT on my test set. I would like to know how you have determined that...
@byshiue Thank you very much for your reply. I'm sorry for the delayed reply. In the past few days, I have been trying to provide a Docker and minimum reproduction...
@byshiue oI modified temperature=1e-6 according to your statement, but I found that errors occur at all times except for the first inference that produces output,self.tokenizer.batch_decode(output_ids[0, :, input_lengths[0] :])will produce error:...
@byshiue [here](https://drive.google.com/file/d/1sVB2PmawwY9810s24pWGCaztAUyFHluf/view?usp=sharing)
@byshiue readme.md in zip.I just start webserver for llava-trt to process the result
@Barry-Delaney What I want to know is how to construct this dataset? My training data is a image text pair, but I only quantify LLM, so theoretically, I should only...
> I also face the same issue. the reason may as follows: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/llm_ptq/README.md#model-support-list thanks for your reply,I have already used VLLM AutoAWQ。
> [@bleedingfight](https://github.com/bleedingfight) , thank you for the update. Just to confirm my understanding: after using vLLM with AutoAWQ and the same models, you’re no longer seeing the issue you reported...
@QiJune My model is a multimodal model, which is slightly different from pure LLM. The difference is that the input of LLM is not input_ids, but a relatively long input_embeds....
@BarrinXu how to replace it? ```cpp void Buffer::low_latency_query_mask_buffer(const torch::Tensor& mask_status) { #ifndef DISABLE_NVSHMEM EP_HOST_ASSERT(mask_buffer_ptr != nullptr and "Shrink mode must be enabled"); EP_HOST_ASSERT(mask_status.numel() == num_ranks && mask_status.scalar_type() == torch::kInt32); internode_ll::query_mask_buffer(...