is it possible to prune gptq models?
def get_llm(model, cache_dir="llm_weights"):
model = AutoModelForCausalLM.from_pretrained(
model,
torch_dtype=torch.float16,
cache_dir=cache_dir,
low_cpu_mem_usage=True,
device_map="auto"
)
model.seqlen = 2048
return model
I am interested if it is possible to prune 4-bit gptq models, also maybe of different sequence length?
Do you mean pruning a model that is quantized to 4-bit/8-bit in the first place? Or given a dense model in fp16, then perform joint quantization and pruning? For the latter, I think it is possible to combine the two compression techniques.
In terms of sequence length, you can adjust the corresponding calibration data. Basically the calibration data should contain sequences with the same context length of the model.
@Eric-mingjie Hi I had a question on this as well would it be possible to prune a 7b model to 50% and further quantize it to 4bit. If so what would be the expected memory usage for the mode assume dense fp16 uses ~14gb
Hi @epinnock , i think the memory usage can not be reduced on GPU with sparsity, but with quantization, you can get memory reduction.