Llama2 quantized q5_1

Open HolmesDomain opened this issue 2 years ago • 1 comments

I am getting this error:

llama.cpp: loading model from /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin
error loading model: unrecognized tensor type 14

llama_init_from_file: failed to load model
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[Error: Failed to initialize LLama context from file: /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin] {
  code: 'GenericFailure'
}

My index.js:

import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";

const model = path.resolve(process.cwd(), "./llama-2-7b-chat/ggml-model-q5_1.bin");
const llama = new LLM(LLamaCpp);
const config = {
    modelPath: model,
    enableLogging: false,
    nCtx: 1024,
    seed: 0,
    f16Kv: false,
    logitsAll: false,
    vocabOnly: false,
    useMlock: false,
    embedding: false,
    useMmap: true,
    nGpuLayers: 0
};

const run = async () => {
    await llama.load(config);
  
    await llama.createCompletion({
        prompt: "My favorite movie is",
        nThreads: 4,
        nTokPredict: 1024,
        topK: 40,
        topP: 0.1,
        temp: 0.3,
        repeatPenalty: 1,
      }, (response) => {
        process.stdout.write(response.token)
      })
  }
  
  run();

It worked before I quantized, but I am hoping quantization makes it faster because it is so slow right now (I'm assuming this would have fixed the speed).

Jul 25 '23 02:07 HolmesDomain

Got it running by using the .bin file from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main

Had no luck generating the q5_1 from here (via the instructions): https://github.com/ggerganov/llama.cpp#prepare-data--run

If this is a common problem maybe you can point people in the direction of just doing a direct download from TheBloke.

Jul 25 '23 21:07 HolmesDomain