Ricardo Lu
Ricardo Lu
May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...
Adapt from https://github.com/lm-sys/FastChat/blob/v0.2.14/fastchat/serve/openai_api_server.py Test on vicuna-7b-v1.3 and WizardCoder.
Right now vLLM will allocate 90% gpu memory for each accessible gpu card, but when launch server with awq model, it will became a unknow behavior. I run awq format...
### Motivation In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished? ### Related resources _No response_ ### Additional context _No response_
## ❓ General Questions In every DecodeStep(), it call [SampleTokenFromLogits()](https://github.com/mlc-ai/mlc-llm/blob/3d25d9da762aab7cd89bfffb8b310f515b2ddabb/cpp/llm_chat.cc#L1208) to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU...
**Description** A clear and concise description of what the bug is. When infer with `response = await client.infer()`, it takes a long time for triton server to release the output....
## 🐛 Bug ## To Reproduce Steps to reproduce the behavior: I compile my model with following command: ```shell mlc_llm convert_weight /home/tsbj/rubik_v0.0.0.25/ --quantization q4f16_1 -o mtk-weights/rubik_v0.0.0.25 mlc_llm gen_config /home/tsbj/rubik_v0.0.0.25/ --quantization...
### Search before asking - [x] I have searched the jetson-containers [issues](https://github.com/dusty-nv/jetson-containers/issues) and found no similar feature requests. ### Question Hi, thanks for your amazing works. I have noticed that...
### Search before asking - [x] I have searched the jetson-containers [issues](https://github.com/dusty-nv/jetson-containers/issues) and found no similar feature requests. ### jetson-containers Component _No response_ ### Bug TensorRT-LLM v0.12.0-jetson require diffusers>=0.27.0, however...