BrickBee
BrickBee
> But when attempting to run an imatrix calculation Same for me with some DeepSeek-based models, which Gorilla is based on. Inference for FP16 and Q8 works, but imatrix calculation...
> error loading model: llama.cpp: tensor 'layers.0.feed_forward.w1.weight' has wrong shape; expected 3200 x 8704, got 3200 x 8640 Same for me. It is also broken in the original commit (ffb06a345e3a9e30d39aaa5b46a23201a74be6de),...
I can confirm that the quantized files that you've linked work fine with the release version that you have linked. My quantized versions that I've created at the time of...
Conversion and fp16 inference works after applying this [diff](https://huggingface.co/SlyEcho/open_llama_3b_ggml/blob/main/convert.py.diff). This was by the way the original point of this issue. The 3b model can't be used with the current code...
Potentially related to issue [8760](https://github.com/ggerganov/llama.cpp/issues/8760#issuecomment-2315639527) which also mentions the difference between (IQ1, IQ2, IQ3) and (IQ4 / K)
It might also improve the performance if a settings is added to pin the router/weighting network to either the VRAM or RAM (selectable). This can reduce the paging-in-from-disk time for...
I did some token-generation testing on the 7950 x3d with IQ and regular quants. It appears that the IQ quants are simply way more computationally expensive than the K quants,...