Ricardo Lu issues

Results 9 issues of


                                            Ricardo Lu

What's the difference between vllm and triton-inference-server?

May vllm can achieve the performance like fastertransformer on inference side? Just curious about the detailed optimization you're done and the goal you want to achieve. BTW, vllm really accelerate...

feat: add ChatCompletion endpoint in OpenAI demo server.

Adapt from https://github.com/lm-sys/FastChat/blob/v0.2.14/fastchat/serve/openai_api_server.py Test on vicuna-7b-v1.3 and WizardCoder.

Error gpu memory utilization with awq model when tp>1.

Right now vLLM will allocate 90% gpu memory for each accessible gpu card, but when launch server with awq model, it will became a unknow behavior. I run awq format...

[Question] When will lmdploy support code llama quantization?

### Motivation In the code-llama's deploy tutorial, quantization chapter remains to be done, when will this feature finished? ### Related resources _No response_ ### Additional context _No response_

backlog

[Question] Why read generation config in every decode step?

## ❓ General Questions In every DecodeStep(), it call [SampleTokenFromLogits()](https://github.com/mlc-ai/mlc-llm/blob/3d25d9da762aab7cd89bfffb8b310f515b2ddabb/cpp/llm_chat.cc#L1208) to sample logits, and it will read generation config, which may become a bottleneck for some devices with poor CPU...

question

Large latency when use `tritonclient.http.aio.infer`

**Description** A clear and concise description of what the bug is. When infer with `response = await client.infer()`, it takes a long time for triton server to release the output....

[Bug] Infinite loop after generate token length near context_windows_size/chunk_prefill_size.

## 🐛 Bug ## To Reproduce Steps to reproduce the behavior: I compile my model with following command: ```shell mlc_llm convert_weight /home/tsbj/rubik_v0.0.0.25/ --quantization q4f16_1 -o mtk-weights/rubik_v0.0.0.25 mlc_llm gen_config /home/tsbj/rubik_v0.0.0.25/ --quantization...

bug

Whether I can build TensorRT-LLM v0.18.1 on Jetson Orin?

### Search before asking - [x] I have searched the jetson-containers [issues](https://github.com/dusty-nv/jetson-containers/issues) and found no similar feature requests. ### Question Hi, thanks for your amazing works. I have noticed that...

question

Can‘t build tensorrt-llm because of version check

### Search before asking - [x] I have searched the jetson-containers [issues](https://github.com/dusty-nv/jetson-containers/issues) and found no similar feature requests. ### jetson-containers Component _No response_ ### Bug TensorRT-LLM v0.12.0-jetson require diffusers>=0.27.0, however...

bug