ZW comments

Results 11 comments of

ZW

Developers should able to controls throttling on api requests

I'm also looking forward to Redis based session support in Brubeck.

Developers should able to controls throttling on api requests

@j2labs I haven't tried brubeck.caching yet. As for Redis based cache, [retools](https://github.com/bbangert/retools) seems a good choice.

[WIP] Adding GPTQ support for llama

Some interesting results https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/227 It seems GPTQ with group-size and act-order has a negative impact on inference performance.

Support partitioner callback in producer

@edenhill It's application specific. Basically `custom_hash(static_cast(msg_opaque)) % partition_cnt` in C++.

FlashLlama doesn't look for safetensors files

Thanks. Latest docker image works.

Inference support for GPTQ (llama + falcon tested) + Quantization script

Is the act-order option supported?

Inference support for GPTQ (llama + falcon tested) + Quantization script

@Narsil It's slower but has better accuracy. https://github.com/lm-sys/FastChat/blob/main/docs/gptq.md

Inference support for GPTQ (llama + falcon tested) + Quantization script

@Narsil For higher speed up of LLaMA models, you can checkout the https://github.com/turboderp/exllama project. I tested it with two 13B models, both quantized with group size 128 and activation order...

Inference support for GPTQ (llama + falcon tested) + Quantization script

I got the following error when loading a [model](https://huggingface.co/TheBloke/robin-65B-v2-GPTQ) quantized with [auto-gptq](https://github.com/PanQiWei/AutoGPTQ) > {"timestamp":"2023-06-20T02:07:57.613035Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in \n sys.exit(app())\n\n File...

Inference support for GPTQ (llama + falcon tested) + Quantization script

@Narsil The model was quantized with groupsize=-1. `gptq_bits` and `gptq_groupsize` are easy to patch into the model files, but I don't know if `gptq_groupsize=[-1]` will be handled correctly. Another difference...