Truenorth8 comments

Repositories
Issues
Comments

Results 2 comments of


                                            Truenorth8

Improve inference speed of Santacoder and Starcoder (and others)

@jlamypoirier These are great suggestions. Have any of these found their way upstream? If not, is your version available anywhere? edit: especially curious about > Compute the model head only...

Support W8A8 inference in vllm

@AniZpZ Existing methods (AWQ, GPTQ) go down to 4-bit quantization, saving lots of memory. The speed improvements of 8-bit inference come during inference, which theoretically could be combined AWQ. Would...