Alan Gray
Alan Gray
Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of...
Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be...
When using a GPU backend, for each token evaluation there exists not only computation on the GPU but also significant CPU computation which can potentially be optimized. Here are some...