Ankan Banerjee

Results 13 comments of Ankan Banerjee

@hsntgm I didn't get what you mean. We are already using beta[0] = 0.0 for fully-connected layers, and for the convolutions that don't need skip connection. For convolutions that need...

_>Cherry-picked a pending pull request to add support for chat (much easier to use and test)._ Just realized that we already have this functionality support in the latest code (was...

> I tested the compiled binary and I could not see any oiutouts from the transformer, even though the GPU showed that the file was loaded. > > I ran...

> @ankan-ban Would not be better to make the computations in FP16 as well? Currently it has lots of conversions > > BTW, I am learning a lot with your...

Sorry about the issues. I was testing with code that was a bit old (old tokenizer and potentially incorrect code for handling prompts). I am going to sync to latest...

I just sync'ed llama2.cu with latest run.c. The issues you were facing should be now fixed. Tested with 4 models: ``` >llama2.cu.exe stories15m.bin 0 256 "once upon a time "...

@richinseattle if you are measuring performance, you may want to try this branch: https://github.com/ankan-ban/llama2.cu/tree/opt I am working on optimizations in this branch. Will decide which ones are worth merging to...

This branch is no longer actively maintained. If you are interested, you can use this repo which uses INT4 weight quantization for ~3.3X more speed and 3x reduction in memory...

Great job! I think you can get some more performance optimizing the mat_vec_q8_kernel() a bit - by loading multiple int8 elements at a time (I think loading just 4 int8...

I tried quantize.c at my end (on a windows system) and it crashes for the llama7b model (when quantizing the q-matrix for 9th layer). I still need to figure out...