Dong-Yong Lee
Dong-Yong Lee
Checklist: * [ ] I've included steps to reproduce the bug. * [ ] I've inclued the version of argo rollouts. **Describe the bug** We are building a canary deployment...
This PR implements the feature of generating text from embedding input (popularly known as inputs_embeds). This is related to https://github.com/vllm-project/vllm/issues/369 and https://github.com/vllm-project/vllm/issues/416. More to do - [x] Enhance test codes...
Hello, tensorrt-llm team, I have been testing the performance for the combination of int8_kv_cache + weight_only(int8) on the llama-2-7b model. (testing with TensorRT-LLM release v0.7.1) The node contains 2 t4...