Hallucination for Llama3-8B-1.58-100B-tokens model with both i2_s and tl2 quantization
Type of issue
-
Thanks guys for this awesome work. I was curious to run llama3-8B on my personal CPU, and the performance is quite impressive (nearly 2x llama.cpp for same model size on same HW).
-
However, I was quite surprised by how much hallucination the model was generating. Basically, for any prompt I was trying, the model generates few tokens to begin with, and then keeps repeating the same sentence over and over again.
-
For example, this is the output using i2_s quantization type:
(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
................................................
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 3669.02 MiB
................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 16.16 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18
system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 4294967295
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She
llama_perf_sampler_print: sampling time = 13.60 ms / 139 runs ( 0.10 ms per token, 10222.84 tokens per second)
llama_perf_context_print: load time = 1256.47 ms
llama_perf_context_print: prompt eval time = 514.62 ms / 11 tokens ( 46.78 ms per token, 21.38 tokens per second)
llama_perf_context_print: eval time = 6035.57 ms / 127 runs ( 47.52 ms per token, 21.04 tokens per second)
llama_perf_context_print: total time = 6591.15 ms / 138 tokens
- The same issue happens again when trying tl2 quantization:
(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-tl2.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
............................................
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 3.33 GiB (3.56 BPW)
llm_load_print_meta: general.name = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size = 0.14 MiB
llm_load_tensors: CPU buffer size = 3405.69 MiB
............................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 32
llama_new_context_with_model: n_ubatch = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 16.16 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18
system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 4294967295
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She
llama_perf_sampler_print: sampling time = 12.62 ms / 139 runs ( 0.09 ms per token, 11016.01 tokens per second)
llama_perf_context_print: load time = 1245.30 ms
llama_perf_context_print: prompt eval time = 646.28 ms / 11 tokens ( 58.75 ms per token, 17.02 tokens per second)
llama_perf_context_print: eval time = 7543.01 ms / 127 runs ( 59.39 ms per token, 16.84 tokens per second)
llama_perf_context_print: total time = 8228.56 ms / 138 tokens
CPU
Intel Core Ultra 7 155H
OS
Windows 11
I ask "who are you?` and look at what it reply🤣.
These models (available on huggingface ) are neither trained nor released by Microsoft. The tested models are used in a research context to demonstrate the inference performance of bitnet.cpp.
Thanks @dawnmsg for your answer. In the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, specifically in Table 2, you demonstrated that your BitNet b1.58 model matches or even surpasses LLaMA's accuracy on various tasks based on the evaluation metrics:
Could you possibly share the BitNet b1.58 LLM used in your research, as it should be more accurate than the public 1bit-LLM available on HuggingFace ?
Thanks @dawnmsg for your answer. In the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, specifically in Table 2, you demonstrated that your BitNet b1.58 model matches or even surpasses LLaMA's accuracy on various tasks based on the evaluation metrics:
Could you possibly share the BitNet b1.58 LLM used in your research, as it should be more accurate than the public 1bit-LLM available on HuggingFace ?
I think they have used this model: https://huggingface.co/1bitLLM/bitnet_b1_58-3B
I'll also try this model, as I got the same results as you.
when running "python setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s" on my windows VM I get this error:
Traceback (most recent call last):
File "C:\BitNet\setup_env.py", line 202, in
What is missing? any clue?
Please try with the latest model on HF. https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf