llama.cpp FP16 and 4-bit quantized model both produce garbage output on M1 8GB

Both the ggml-model-q4_0 and ggml-model-f16 produce a garbage output on my M1 Air 8GB, using the 7B LLaMA model. I've seen the quantized model having problems but I doubt the quantization is the issue as the non-quantized model produces the same output.

➜  llama.cpp git:(master) ./main -m ./models/7B/ggml-model-f16.bin -p "Building a website can be done in 10 simple steps:" -t 8 -n 512
main: seed = 1678812348
llama_model_load: loading model from './models/7B/ggml-model-f16.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 1
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: ggml ctx size = 13365.09 MB
llama_model_load: memory_size =   512.00 MB, n_mem = 16384
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-f16.bin'
llama_model_load: ........... done
llama_model_load: model size =  4274.30 MB / num tensors = 90

system_info: n_threads = 8 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |

main: prompt: 'Building a website can be done in 10 simple steps:'
main: number of tokens in prompt = 15
     1 -> ''
  8893 -> 'Build'
   292 -> 'ing'
   263 -> ' a'
  4700 -> ' website'
   508 -> ' can'
   367 -> ' be'
  2309 -> ' done'
   297 -> ' in'
 29871 -> ' '
 29896 -> '1'
 29900 -> '0'
  2560 -> ' simple'
  6576 -> ' steps'
 29901 -> ':'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Building a website can be done in 10 simple steps:Administrationistrunkoveryabasepair tou cross deprecatedinition holes prvindor^C

Mar 14 '23 17:03 Alden5

Thtats nothing to do with this project.

Mar 14 '23 17:03 v3ss0n

@v3ss0n could you please elaborate?

Mar 14 '23 19:03 Alden5

I started messing with this project two hours ago and had exactly same issue. Completely mangled output. Turns out for me the problem was that I compiled it with Cygwin. After re-compiling clean with MinGW64 via w64devkit the problem disappeared. The tip-off was that token list(shown as input prompt deconstructed into individual tokens) sometimes didn't match the prompt(occasionally it was truncated). In your case it looks fine but try to alter prompt by adding and removing words. If you see tokenized prompt not matching your prompt then perhaps you have same problem with compiler... Or something... Honestly, I don't know what could go so wrong that it compiles without errors into broken binary. The wild west of C and pluses.

Mar 14 '23 20:03 jarcen

I found the solution to my issue! make sure that when you're using the convert-pth-to-ggml.py script that it completes and tells you Done. Output file: I was getting the error OSError: 45088768 requested and 31184896 written but didn't pay attention because everything else looked to be working. the f16 model will compile to 13GB! so the fact that i only had 10GB of storage was causing the problem.

Mar 14 '23 20:03 Alden5