How to inference in 4bit?

Open DanielProkhorov opened this issue 2 years ago • 1 comments

When I load the model in 4bit:

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("cckevinn/SeeClick", device_map="auto", trust_remote_code=True, load_in_4bit=True, do_sample=True, temperature=1e-3).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

I get the following error during inference:

RuntimeError: Input type (torch.cuda.ByteTensor) and weight type (torch.cuda.HalfTensor) should be the same

Jan 29 '24 14:01 DanielProkhorov

Our fine-tuning is in bf16, so maybe we need to check the Qwen-VL repository for quantization.

Jan 29 '24 15:01 njucckevin