BitNet Hallucination for Llama3-8B-1.58-100B-tokens model with both i2

Type of issue

Thanks guys for this awesome work. I was curious to run llama3-8B on my personal CPU, and the performance is quite impressive (nearly 2x llama.cpp for same model size on same HW).
However, I was quite surprised by how much hallucination the model was generating. Basically, for any prompt I was trying, the model generates few tokens to begin with, and then keeps repeating the same sentence over and over again.
For example, this is the output using i2_s quantization type:

(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-i2_s.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
main: load the model and apply lora adapter, if any
................................................
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3669.02 MiB
................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18

system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1

Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She

llama_perf_sampler_print:    sampling time =      13.60 ms /   139 runs   (    0.10 ms per token, 10222.84 tokens per second)
llama_perf_context_print:        load time =    1256.47 ms
llama_perf_context_print: prompt eval time =     514.62 ms /    11 tokens (   46.78 ms per token,    21.38 tokens per second)
llama_perf_context_print:        eval time =    6035.57 ms /   127 runs   (   47.52 ms per token,    21.04 tokens per second)
llama_perf_context_print:       total time =    6591.15 ms /   138 tokens

The same issue happens again when trying tl2 quantization:

(bitnet-cpp) C:\Users\ahouz\Desktop\aahouzi\BitNet>python run_inference.py -m models\Llama3-8B-1.58-100B-tokens\ggml-model-tl2.gguf -p "Once upon a time, there was a girl who" -n 128 -temp 0 -t 18
warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with Clang 17.0.3 for x64
main: llama backend init
............................................
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.33 GiB (3.56 BPW)
llm_load_print_meta: general.name     = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3405.69 MiB
............................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 18

system_info: n_threads = 18 (n_threads_batch = 18) / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1

Once upon a time, there was a girl who was very beautiful. She was so beautiful that she was called the most beautiful girl in the world. She was called the most beautiful girl in the world because she was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She was so beautiful that she was called the most beautiful girl in the world. She

llama_perf_sampler_print:    sampling time =      12.62 ms /   139 runs   (    0.09 ms per token, 11016.01 tokens per second)
llama_perf_context_print:        load time =    1245.30 ms
llama_perf_context_print: prompt eval time =     646.28 ms /    11 tokens (   58.75 ms per token,    17.02 tokens per second)
llama_perf_context_print:        eval time =    7543.01 ms /   127 runs   (   59.39 ms per token,    16.84 tokens per second)
llama_perf_context_print:       total time =    8228.56 ms /   138 tokens

CPU

Intel Core Ultra 7 155H

OS

Windows 11

Oct 18 '24 14:10 aahouzi

I ask "who are you?` and look at what it reply🤣.

Oct 19 '24 13:10 avcode-exe

These models (available on huggingface ) are neither trained nor released by Microsoft. The tested models are used in a research context to demonstrate the inference performance of bitnet.cpp.

Oct 21 '24 09:10 dawnmsg

Thanks @dawnmsg for your answer. In the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, specifically in Table 2, you demonstrated that your BitNet b1.58 model matches or even surpasses LLaMA's accuracy on various tasks based on the evaluation metrics:

Could you possibly share the BitNet b1.58 LLM used in your research, as it should be more accurate than the public 1bit-LLM available on HuggingFace ?

Oct 21 '24 10:10 aahouzi

Thanks @dawnmsg for your answer. In the paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits, specifically in Table 2, you demonstrated that your BitNet b1.58 model matches or even surpasses LLaMA's accuracy on various tasks based on the evaluation metrics:

Could you possibly share the BitNet b1.58 LLM used in your research, as it should be more accurate than the public 1bit-LLM available on HuggingFace ?

I think they have used this model: https://huggingface.co/1bitLLM/bitnet_b1_58-3B

I'll also try this model, as I got the same results as you.

Oct 21 '24 12:10 Yarflam

when running "python setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s" on my windows VM I get this error:

Traceback (most recent call last): File "C:\BitNet\setup_env.py", line 202, in main() File "C:\BitNet\setup_env.py", line 179, in main compile() File "C:\BitNet\setup_env.py", line 163, in compile cmake_exists = subprocess.run(["cmake", "--version"], capture_output=True) File "C:\Users\myname\Anaconda3\envs\bitnet-cpp\lib\subprocess.py", line 505, in run with Popen(*popenargs, **kwargs) as process: File "C:\Users\myname\Anaconda3\envs\bitnet-cpp\lib\subprocess.py", line 951, in init self._execute_child(args, executable, preexec_fn, close_fds, File "C:\Users\myname\Anaconda3\envs\bitnet-cpp\lib\subprocess.py", line 1436, in _execute_child hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] The system cannot find the file specified

What is missing? any clue?

Oct 24 '24 16:10 bahartaji

Please try with the latest model on HF. https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf

Apr 17 '25 07:04 sd983527

Hallucination for Llama3-8B-1.58-100B-tokens model with both i2_s and tl2 quantization

Type of issue

CPU

OS