ZW

Results 11 comments of ZW

I'm also looking forward to Redis based session support in Brubeck.

@j2labs I haven't tried brubeck.caching yet. As for Redis based cache, [retools](https://github.com/bbangert/retools) seems a good choice.

Some interesting results https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/227 It seems GPTQ with group-size and act-order has a negative impact on inference performance.

@edenhill It's application specific. Basically `custom_hash(static_cast(msg_opaque)) % partition_cnt` in C++.

Thanks. Latest docker image works.

@Narsil It's slower but has better accuracy. https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md

@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order...

I got the following error when loading a [model](https://huggingface.co/TheBloke/robin-65B-v2-GPTQ) quantized with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ) > {"timestamp":"2023-06-20T02:07:57.613035Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in \n sys.exit(app())\n\n File...

@Narsil The model was quantized with groupsize=-1. `gptq_bits` and `gptq_groupsize` are easy to patch into the model files, but I don't know if `gptq_groupsize=[-1]` will be handled correctly. Another difference...