Yineng Zhang

Results 452 comments of Yineng Zhang

TCP 用到滑动窗口算法的应该是流量控制,不是拥塞控制

> Hi tvm genius. I have the same issue. macOS version: macOS Monterey Version 12.4 tvm commit: main branch 8341e33d0 `mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. &&...

@Hzfengsy @YuchenJin Hi TVM genius, when will relax be merged to tvm main upstream?

Hi vLLM genius @WoosukKwon @zhuohan123 This is the latest development from our team regarding quantitative support for vllm, we have done something similar to https://github.com/vllm-project/vllm/pull/1032 before. At that time, we...

Hi @guocuimi Thanks for your outstanding work. In addition to performance comparison with vLLM, if possible, please consider adding [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [LMDeploy](https://github.com/InternLM/lmdeploy), [RTP-LLM](https://github.com/alibaba/rtp-llm), and [TGI](https://github.com/huggingface/text-generation-inference). And maybe we could use [vLLM...

Hi @guocuimi May you use GitHub Action to release the Python Package? Consider supporting CUDA 11.8 and CUDA 12.2, which will make it more convenient for users to use. At...

Hi @LiuXiaoxuanPKU Great work! After switching to the new backend, has there been any performance improvement compared to before? Have you conducted any relevant benchmarks? Thanks.

Hi @LiuXiaoxuanPKU Is FlashInfer currently enabled by default? After testing the throughput on the ShareGPT dataset, there was no significant improvement on vLLM, and the gap with LMDeploy is still...

> Hi @zhyncs, thanks for the interest and benchmarking, several things here: > > FlashInfer is not turned on by default, it can only be enabled with environment variable `VLLM_ATTENTION_BACKEND=FLASHINFER`....

> What's the reason it is not supported in this PR? The internal inference implementation supports MLA. The implementation on vLLM is **more about making it support quickly and matching...