Henri Vasserman

Results 249 comments of Henri Vasserman

I think most of the bigger buffers are RO, RW is used for reading the results back from the GPU and they are smaller usually (`n_batch * n_embd`).

How did you compile llama.cpp? What compiler? Did you get any errors?

What was the command that you used to quantize? It should be `./build/bin/quantize ./models/ggml-model-f16.bin ./models/ggml-model-q4_0.bin 2` or similar assuming you are in the llama.cpp root and that your CMake build...

Alpaca uses special formatting to separate instructions and data. You can see the [templates used for tloen/alpaca-lora](https://github.com/tloen/alpaca-lora/blob/main/templates/alpaca.json). There are two variants, one with just instruction, and one with instruction and...

The provided Windows build with CLBlast using OpenCL should work but I wouldn't expect any significant performance gains from integrated graphics.

> copied OpenBLAS required file in the folders > then I followed the "Intel MKL" section Which one did you actually use? Did it actually find the Intel MKL library?...

> Maybe the GPU only accelerates 16bit operations, so the CPU is faster because it can run the 4bit stuff...? The OpenCL code in llama.cpp can run 4-bit generation on...

> the good thing is that you don't need to copy data vram->ram to access the data on cpu, it's just always shared by both llama.cpp is not optimized for...

If you had a dedicated GPU, bringing down the prompt evaluation below 60s @ 1000 tokens is very much doable.

There is [train-text-from-scratch](https://github.com/ggerganov/llama.cpp/tree/master/examples/train-text-from-scratch) but it's early days still.