Hongbo Xu

Results 33 comments of Hongbo Xu

I got the same error, set 'inject_fused_mlp=False' and 'inject_fused_attention=False' works for me. ``` model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False, inject_fused_mlp=False, inject_fused_attention=False) ```

After quantization, I built the model. ``` python build.py --model_dir /target/model/hf_model_v15 \ --quant_ckpt_path /target/model/quantized_int4-awq/llama_tp1_rank0.npz \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --use_weight_only \...

> Hello, have you solved this issue? I also encountered the same issue. I have solved this You should modify func `load_from_awq_llama` in the `weight.py`

手机端因为屏幕比较小,所以字体不能完整显示,一个简单的解决方法是每次只显示一个字或者两个字。

> > Hi @alexsamardzic, thanks for working on this. Just wanted to clarify, will this kernel support int4 grouped per channel weight quantization + int8 per token dynamic activation quantization?...

> > How can I integrate this PR with PyTorch? Are there any example codes available ? @alexsamardzic > > The primary motivation for this PR is to have this...

> > I'm a beginner with Cutlass, I have on idea how to use my own constructed s4/s8 data to run this GEMM. Could you please provide an example code...

> > I have two s4 values packed in a single byte(uint8). Do I need to unpack the uint8 data to get s4 data before GEMM manually? > > No,...

> > Assuming that A is int8 and (M, K), B is int4 and (K, N), after GEMM:` C = A·B`, and C will be (M, N). Now, I have...

> > Thanks, I’m trying this, but it’s not going well currently. To make it clearer, what I want to do is exactly the following: > > ``` > >...