twaka comments

Results 4 comments of


                                            twaka

[Feature]: FP6

~Is this feature now usable with non-arctic model?~ Installed from source and it works as expected, amazing! 1. Download model https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct for example 2. Put https://huggingface.co/Snowflake/snowflake-arctic-instruct/blob/main/quant_config.json into model dir 3....

What happens when GPU KV cache usage reaching 100%?

I'm also experiencing this. At least, if num of prompt tokens is longer than num of GPU blocks * block_size, entire server stops working. Setting smaller max_model_len seems to alleviate...

[Feature]: FP6

I think increase of latency is expected unless integration of fo6_llm's kernel since dequantize and matmul are not fused in deepspeedfp implementation currently.

[Bug]: LLaMa 3.1 8B/70B/405B all behave poorly and differently using completions API as compared to good chat API

Could it be caused by existence of double ? Completion API automatically adds it while Chat API doesn't (because it's included in chat template).