Tai Duc Nguyen
Tai Duc Nguyen
I am seeing extreme slowdown with MatthewsCorrCoef too. What used to take less than a second for me now takes 10 minutes! Reverting back to 0.9.0 or 0.8.2 works just...
Hey guys, here's a PR I made to do this: https://github.com/ggerganov/llama.cpp/pull/403. Please check it out. If you have any questions, don't hesitate to ask here.
> I converted a 30b 4bit ggml model https://huggingface.co/Pi3141/alpaca-30B-ggml/tree/main back to pytorch (hf), but the resulting file was 65gb instead of about 20gb > > Is it possible for 4bit...
Well, I suppose they quantize the weights to 4bit then save it as 4bit, which you can do easily with a bit of modification on my code. However, at inference,...
@anzz1 Thank you for your comment. However, what if you want to study the effect of finetuning on quantized models? Or simply want to look at the distribution of weights...
@anzz1 @ggerganov Any idea how I can get this PR reviewed/accepted? I am willing to put in more work to make it run correctly and smoothly.
> @ggerganov any reason why this was removed from main? I think it's because some time ago there were lots and lots of breaking changes to the implementation that the...
I was able to modify the mcp.py file and it's working in my tests. I also added schema validation and error handling for invalid input parameters. Let me know if...
> Also, btw since in LitServe the decode_request argument is bound to be called `request` - the MCP properties must be request. I think you are a bit confused here...