Alan Gray

Results 4 issues of Alan Gray

Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of...

See Issue #6763

performance
need feedback

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be...

Review Complexity : Medium
ggml

When using a GPU backend, for each token evaluation there exists not only computation on the GPU but also significant CPU computation which can potentially be optimized. Here are some...

performance
research 🔬
stale