Ankan Banerjee comments

Results 13 comments of


                                            Ankan Banerjee

pre-compiled lczero.exe with cuDNN support?

@hsntgm I didn't get what you mean. We are already using beta[0] = 0.0 for fully-connected layers, and for the convolutions that don't need skip connection. For convolutions that need...

llama2.cu - a simple cuda implementation

_>Cherry-picked a pending pull request to add support for chat (much easier to use and test)._ Just realized that we already have this functionality support in the latest code (was...

llama2.cu - a simple cuda implementation

> I tested the compiled binary and I could not see any oiutouts from the transformer, even though the GPU showed that the file was loaded. > > I ran...

llama2.cu - a simple cuda implementation

> @ankan-ban Would not be better to make the computations in FP16 as well? Currently it has lots of conversions > > BTW, I am learning a lot with your...

llama2.cu - a simple cuda implementation

Sorry about the issues. I was testing with code that was a bit old (old tokenizer and potentially incorrect code for handling prompts). I am going to sync to latest...

llama2.cu - a simple cuda implementation

I just sync'ed llama2.cu with latest run.c. The issues you were facing should be now fixed. Tested with 4 models: ``` >llama2.cu.exe stories15m.bin 0 256 "once upon a time "...

llama2.cu - a simple cuda implementation

@richinseattle if you are measuring performance, you may want to try this branch: https://github.com/ankan-ban/llama2.cu/tree/opt I am working on optimizations in this branch. Will decide which ones are worth merging to...

llama2.cu - a simple cuda implementation

This branch is no longer actively maintained. If you are interested, you can use this repo which uses INT4 weight quantization for ~3.3X more speed and 3x reduction in memory...

float16 and 8-bit CUDA implementations

Great job! I think you can get some more performance optimizing the mat_vec_q8_kernel() a bit - by loading multiple int8 elements at a time (I think loading just 4 int8...

float16 and 8-bit CUDA implementations

I tried quantize.c at my end (on a windows system) and it crashes for the llama7b model (when quantizing the q-matrix for 9th layer). I still need to figure out...