Henri Vasserman
Henri Vasserman
I think most of the bigger buffers are RO, RW is used for reading the results back from the GPU and they are smaller usually (`n_batch * n_embd`).
How did you compile llama.cpp? What compiler? Did you get any errors?
What was the command that you used to quantize? It should be `./build/bin/quantize ./models/ggml-model-f16.bin ./models/ggml-model-q4_0.bin 2` or similar assuming you are in the llama.cpp root and that your CMake build...
Alpaca uses special formatting to separate instructions and data. You can see the [templates used for tloen/alpaca-lora](https://github.com/tloen/alpaca-lora/blob/main/templates/alpaca.json). There are two variants, one with just instruction, and one with instruction and...
The provided Windows build with CLBlast using OpenCL should work but I wouldn't expect any significant performance gains from integrated graphics.
> copied OpenBLAS required file in the folders > then I followed the "Intel MKL" section Which one did you actually use? Did it actually find the Intel MKL library?...
> Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...? The OpenCL code in llama.cpp can run 4-bit generation on...
> the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both llama.cpp is not optimized for...
If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.
There is [train-text-from-scratch](https://github.com/ggerganov/llama.cpp/tree/master/examples/train-text-from-scratch) but it's early days still.