Results 79 comments of HandH1998

I agree with you. And I think the original code is a bug. `sum(self.num_tokens[:block_idx])` sholde be right.

> > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as you know...

> > > > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as...

> > > > > Thanks for your answer. And did you apply partial quantization which mean that down_proj layer remain as a fp16 because of big activation range. as...

> Hi, Is this per-token quantization patch only support single card? > > I tested this patch on A10 with llama2-7b, there is no problem if I run with single...

> use your code, i got this error, module 'lightseq.inference' has no attribute 'Llama' . could you tell how you bypass this? @HandH1998 It seems that you didn't compile it...

> Hi @HandH1998 Nice work! May you merge the latest main branch and fix the conflicts? Done

> @HandH1998 May you resolve the conflicts in these days? After that, @lzhangzz will help rewrite with the TurboMind’s style. We should move forward together. I am working on it....

@zhyncs @lzhangzz I have resolved the conflicts, and you can continue to do the optimization work. Two checks failed, but I think they are irrelevant with my code.

目前QQQ的代码只支持了Qwen的纯文本模型,如果你想量化QwenVL的语言模块,应该修改代码,增加视觉编码那块的代码,并且用带图片的样本做校准。