Oxi84

Results 62 comments of Oxi84

Good to know. You are doing great job. So is it now faster or slower than fp16 for GPT-J case? I will try in few days myself. So far i...

For me it takes around 250 seconds to generate 1000 words on RTX 3090, when using 8bit without ,int8_threshold=0. When using ,int8_threshold=0, the generation time is 88 seconds. For 500...

It is awesome you made this. Chinese GLM even works on 4 bits. https://github.com/THUDM/GLM-130B It seem to be the best language model so far.

Yes, this one is pretty fast around 2x faster in 4bits that fp16. But faster qlora will be better as it supports most models available. With GPTQ you can pretty...

For me the same thing, it is slower around 10 percent, i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20. Probably the...

I tried on another CPU and now it is 2x slower (without quantisation) than Pytorch with the same settings as above: i run batch size around 10-15 beam size is...

It does wok faster when using smaller batches and when using less cores. It is probably optimal to divide all cpu cores using pytorch thread number set and then use...

> custom logits processors > @iiglesias-asapp I see your point - controlling at a token level may be advantageous. Nevertheless, i) without a specific common use case in mind and...

It worked when I used the notebook ( https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/pytorch/sequence_classification.ipynb) tha goes along with the text - seems like it is updated or simply I made some mistakes when copy pasting...