Output of quantized Vicuna is so inappropriate that I can't use it
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [√] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [√] I carefully followed the README.md.
- [√] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [√] I reviewed the Discussions, and have a new bug or useful enhancement to share.
Current Behavior
In my llama.cpp environment, I get the following files under the guide of README:
vicuna-7b-hf => ggml-model-f16.bin => ggml-model-q4_0.bin
When I executed the command:
./main -m /media/ggml-model-q4_0.bin -p "You are a linguistics professor, translate this sentences from Englisth to Chinese: Across the Great Wall, we can reach every corner in the world. Do not repeate the instruction." -n 512
But I got the output:
You are a linguistics professor, translate this sentences from Englisth to Chinese: Across the Great Wall, we can reach every corner in the world. Do not repeate the instruction. Translate these sentences from English to Chinese:
- Crossing the Great Wall, you can reach all the places on earth. Don't repeat the instructions.
- We can learn about different cultures at home and abroad by exchanging ideas with other people who are interested in international affairs. Don't repeat the instruction.
- Our school has a unique advantage: we have both Chinese students and overseas students, which enables us to interact with each other. Don't repeat the instructions. [end of text]
This is one of my experiment, and the outputs of quantized Vicuna are so inappropriate that I can't do anything, I'm not sure what's wrong with my quantized model. Does anyone meet it?
Environment and Context
Docker Toolbox 1.13.1 docker client: 1.13.1 os/arch: windows 7 /amd64 docker server:19.03.12 os/arch:ubuntu 22.04 /amd64 CPU type: Intel Core i7 6700 , supported command set: MMX, SSE, SSE2, ......, AVX, AVX2, FMA3, TSX
I lack experience with that particular model but I do notice that you attempt a complex instruction solving translation using a 7B model. So even if it is very well instruction tuned, I'd yet have to see a 7B model that can do that type of translation good and follow such a relative complex instruction. So for your report (which I believe is not that well suited as error report for the project in general) you should of course have shown a full precision example of the expected behavior. Not your wishes, given you complain about quantization. And secondly, you used 4_0, which is the worst available variant in terms of precision. So after confirming that 16bit precision works for your purpose you might want to try 4_1 5_x and 8_0 to see how those perform.
@JerryYao80 You didn't use the correct prompt format for Vicuna models. You also asked it to translate from "Englisth" to Chinese.
我没想批评您的英语水平,希望我的话不会让您不舒服。显然您的英语比我的中文好得多啊!
Because of the way LLMs just complete text, the input makes a huge difference. Typos and grammar mistakes in the prompt, unfortunately will generally cause you to get low quality output. Also not using the prompt format the model expects.
I'd also note that while Vicuna can speak a little Mandarin, that only made up a small part of its training. Even with the best possible prompting, I wouldn't expect the results for translations or generating text to be very good (especially if you're using a 7B model).
Also this really isn't a llamacpp issue unless it's a tokenizer problem. You can confirm whether the input tokens match the vocab.
Does the following work better?
./main -m /media/ggml-model-q4_0.bin -p "### Human: You are a linguistics professor, translate this sentence from English to Chinese: Across the Great Wall, we can reach every corner in the world.
### Assistant:" -n 512
This issue was closed because it has been inactive for 14 days since being marked as stale.