HandH1998 comments

Results 79 comments of


                                            HandH1998

The embedding/encoder may not be working the way you think it is

I agree with you. And I think the original code is a bug. `sum(self.num_tokens[:block_idx])` sholde be right.

Support W8A8 inference in vllm

> > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know...

Support W8A8 inference in vllm

> > > > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as...

Support W8A8 inference in vllm

> > > > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as...

Support W8A8 inference in vllm

> Hi, Is this per-token quantization patch only support single card? > > I tested this patch on A10 with llama2-7b, there is no problem if I run with single...

llama inference test

> use your code, i got this error, module 'lightseq.inference' has no attribute 'Llama' . could you tell how you bypass this? @HandH1998 It seems that you didn't compile it...

[Feature] support qqq(w4a8) for lmdeploy

> Hi @HandH1998 Nice work! May you merge the latest main branch and fix the conflicts? Done

[Feature] support qqq(w4a8) for lmdeploy

> @HandH1998 May you resolve the conflicts in these days? After that, @lzhangzz will help rewrite with the TurboMind’s style. We should move forward together. I am working on it....

[Feature] support qqq(w4a8) for lmdeploy

@zhyncs @lzhangzz I have resolved the conflicts, and you can continue to do the optimization work. Two checks failed, but I think they are irrelevant with my code.

QwenVL的量化

目前QQQ的代码只支持了Qwen的纯文本模型，如果你想量化QwenVL的语言模块，应该修改代码，增加视觉编码那块的代码，并且用带图片的样本做校准。