llama-node
llama-node copied to clipboard
Llama2 quantized q5_1
I am getting this error:
llama.cpp: loading model from /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin
error loading model: unrecognized tensor type 14
llama_init_from_file: failed to load model
node:internal/process/promises:289
triggerUncaughtException(err, true /* fromPromise */);
^
[Error: Failed to initialize LLama context from file: /Documents/Proj/delta/llama-2-7b-chat/ggml-model-q5_1.bin] {
code: 'GenericFailure'
}
My index.js:
import { LLM } from "llama-node";
import { LLamaCpp } from "llama-node/dist/llm/llama-cpp.js";
import path from "path";
const model = path.resolve(process.cwd(), "./llama-2-7b-chat/ggml-model-q5_1.bin");
const llama = new LLM(LLamaCpp);
const config = {
modelPath: model,
enableLogging: false,
nCtx: 1024,
seed: 0,
f16Kv: false,
logitsAll: false,
vocabOnly: false,
useMlock: false,
embedding: false,
useMmap: true,
nGpuLayers: 0
};
const run = async () => {
await llama.load(config);
await llama.createCompletion({
prompt: "My favorite movie is",
nThreads: 4,
nTokPredict: 1024,
topK: 40,
topP: 0.1,
temp: 0.3,
repeatPenalty: 1,
}, (response) => {
process.stdout.write(response.token)
})
}
run();
It worked before I quantized, but I am hoping quantization makes it faster because it is so slow right now (I'm assuming this would have fixed the speed).
Got it running by using the .bin file from here: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
Had no luck generating the q5_1 from here (via the instructions): https://github.com/ggerganov/llama.cpp#prepare-data--run
If this is a common problem maybe you can point people in the direction of just doing a direct download from TheBloke.