youki sada
youki sada
@byshiue Thank you for your reply. For generating 1st output token, TC utilization (TENSO) of LLaMA w/ int4 WQ is lower than fp16 and also general CNN models. I assume...
> what's the meaning of "utilized weight reuse"? I meant computational intensity is high in first token inference. Thus, I assume DRAMA of the int4 first inference should be reduced...
I also request this support. Adding `'Qwen2_5_VLForConditionalGeneration': QWenForCausalLM` in `MODEL_MAP` didn't work. ``` 'Qwen2_5_VLForConditionalGeneration': QWenForCausalLM, ``` https://github.com/NVIDIA/TensorRT-LLM/blob/258c7540c03517def55d9a5aadfa9288af474e1b/tensorrt_llm/models/__init__.py#L179 ``` [TensorRT-LLM] TensorRT-LLM version: 0.17.0 ~~~ omit ~~~ [03/03/2025-04:54:10] [TRT-LLM] [I] Set nccl_plugin...