Ankan Banerjee

Results 2 issues of Ankan Banerjee

slightly more than 2x speedup (for large batch sizes) on supported hardware, without much loss of precision.

lc0

Add simple cuda implementation for llama2 inference * < 750 lines of code. Idea is to keep it as simple as possible. * Decided to use FP16 to make llama-7b...