TechxGenus

Results 11 comments of TechxGenus

Thanks for your reply. Close this issue.

This issue seems to still be unresolved. Inference for the AWQ model is now back to normal, but errors still occur when trying to quantify the Llama or Gemma models.

I found the same problem. It occurs when setting n (the number of sequences returned) greater than 1, and occurs frequently when there is less gpu memory. A simple solution...

I checked its architecture and it shouldn't be very hard to implement basic quantization. But its position encoding is special (LongRoPE) and implementing the fusion layer might need more work.

I think this is the best way to scale for Cohere: ``` from .base import BaseAWQForCausalLM from transformers.models.cohere.modeling_cohere import ( CohereDecoderLayer as OldCohereDecoderLayer, CohereForCausalLM as OldCohereForCausalLM, ) class CohereAWQForCausalLM(BaseAWQForCausalLM): layer_type...

Hi @casper-hansen , I tested it with transformers, works well. Marked as draft as missing fusion layer implementation. I don't have enough hardware to write and test it recently. Maybe...

Hi @casper-hansen , this PR has been tested in both modes and is ready for merging.

Amazing work! I initially tested Jamba-v0.1 on a machine with 500G RAM and it worked great! ``` ./main -m ./Jamba-v0.1-hf-00001-of-00024.gguf -n 120 --prompt "def max(arr):" --temp 0 Log start main:...

Although this issue has little impact on the training results, it significantly affects to reproduce experiments across different hardware configurations. I hope it can be resolved alongside gradient accumulation. I...