Sekri0 comments

Results 8 comments of


                                            Sekri0

No reduction in model size

> We are developing a complete link from pseudo-quantized models to real packing weights and directly executing WxAy quantized inference in Torch, which is expected to be released within a...

Seqlen of Kernel Benchmark

Thanks for the reply, I have one more question. In the end-to-end experiment, which kernel is used in the prefill phase of the w2a8 model

demo_onnx.py运行报错

我补充了详细的环境信息。我目前装的FunASR 1.2.0似乎已经是最新的，报错问题仍然存在

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access

@zachzzc @raywanb Sorry to bother you guys, could you please take a look at this problem?

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access

> Can you provide a minimum script to reproduce your problem ? @Sekri0 Sorry for not replying in time, this issue occurs midway through the inference service, so I'm not...

[Bug]: vLLM 0.5.5 using prefix caching causing CUDA error: illegal memory access

> > > Can you provide a minimum script to reproduce your problem ? @Sekri0 > > > > > > Sorry for not replying in time, this issue occurs...

[BUG] The dynamically quantized MoE model failed to deploy in vLLM.

> [@liweiqing1997](https://github.com/liweiqing1997) Totally understand. We will try to quant and fix this by next week. The bug is most likely in vLLM change model parameter names based on your stack...