Alan Gray issues

Repositories
Issues
Comments

Results 4 issues of


                                            Alan Gray

New optimization from NVIDIA to use CUDA Graphs in llama.cpp

Great work everyone on llama.cpp! I am Alan Gray, a developer technology engineer from NVIDIA, and have developed an optimization to allow the CUDA kernels associated with the generation of...

DRAFT: Introduction of CUDA Graphs to LLama.cpp

See Issue #6763

performance

need feedback

ggml: avoid rebuild of GGML graph for each token (#7456)

Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be...

Review Complexity : Medium

ggml

Optimisation of per-token CPU activities for GPU inference

When using a GPU backend, for each token evaluation there exists not only computation on the GPU but also significant CPU computation which can potentially be optimized. Here are some...

performance

research 🔬

stale