Yineng Zhang comments

Results 452 comments of


                                            Yineng Zhang

「Comment」https://icyfenix.cn/distribution/traffic-management/traffic-control.html

TCP 用到滑动窗口算法的应该是流量控制，不是拥塞控制

[make error] 3rdparty/libbacktrace/configure Syntax error

> Hi tvm genius. I have the same issue. macOS version: macOS Monterey Version 12.4 tvm commit: main branch 8341e33d0 `mkdir build && cd build && cmake -DCMAKE_BUILD_TYPE=Debug .. &&...

[release] v0.10.0 Release Schedule

@Hzfengsy @YuchenJin Hi TVM genius, when will relax be merged to tvm main upstream?

Support int8 KVCacheQuant and W8A8 inference in vllm

Hi vLLM genius @WoosukKwon @zhuohan123 This is the latest development from our team regarding quantitative support for vllm, we have done something similar to https://github.com/vllm-project/vllm/pull/1032 before. At that time, we...

ScaleLLM vs vLLM in performance

Hi @guocuimi Thanks for your outstanding work. In addition to performance comparison with vLLM, if possible, please consider adding [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [LMDeploy](https://github.com/InternLM/lmdeploy), [RTP-LLM](https://github.com/alibaba/rtp-llm), and [TGI](https://github.com/huggingface/text-generation-inference). And maybe we could use [vLLM...

ScaleLLM vs vLLM in performance

Hi @guocuimi May you use GitHub Action to release the Python Package? Consider supporting CUDA 11.8 and CUDA 12.2, which will make it more convenient for users to use. At...

[Kernel] Use flashinfer for decoding

Hi @LiuXiaoxuanPKU Great work! After switching to the new backend, has there been any performance improvement compared to before? Have you conducted any relevant benchmarks? Thanks.

[Kernel] Use flashinfer for decoding

Hi @LiuXiaoxuanPKU Is FlashInfer currently enabled by default? After testing the throughput on the ShareGPT dataset, there was no significant improvement on vLLM, and the gap with LMDeploy is still...

[Kernel] Use flashinfer for decoding

> Hi @zhyncs, thanks for the interest and benchmarking, several things here: > > FlashInfer is not turned on by default, it can only be enabled with environment variable `VLLM_ATTENTION_BACKEND=FLASHINFER`....

Support Deepseek-V2

> What's the reason it is not supported in this PR? The internal inference implementation supports MLA. The implementation on vLLM is **more about making it support quickly and matching...