wanda icon indicating copy to clipboard operation
wanda copied to clipboard

is it possible to prune gptq models?

Open GrailFinder opened this issue 2 years ago • 3 comments

def get_llm(model, cache_dir="llm_weights"):
    model = AutoModelForCausalLM.from_pretrained(
        model, 
        torch_dtype=torch.float16, 
        cache_dir=cache_dir, 
        low_cpu_mem_usage=True, 
        device_map="auto"
    )

    model.seqlen = 2048
    return model

I am interested if it is possible to prune 4-bit gptq models, also maybe of different sequence length?

GrailFinder avatar Jun 28 '23 13:06 GrailFinder

Do you mean pruning a model that is quantized to 4-bit/8-bit in the first place? Or given a dense model in fp16, then perform joint quantization and pruning? For the latter, I think it is possible to combine the two compression techniques.

In terms of sequence length, you can adjust the corresponding calibration data. Basically the calibration data should contain sequences with the same context length of the model.

Eric-mingjie avatar Jun 29 '23 10:06 Eric-mingjie

@Eric-mingjie Hi I had a question on this as well would it be possible to prune a 7b model to 50% and further quantize it to 4bit. If so what would be the expected memory usage for the mode assume dense fp16 uses ~14gb

epinnock avatar Sep 07 '23 23:09 epinnock

Hi @epinnock , i think the memory usage can not be reduced on GPU with sparsity, but with quantization, you can get memory reduction.

Eric-mingjie avatar Sep 22 '23 23:09 Eric-mingjie