twaka

Results 4 comments of twaka

~Is this feature now usable with non-arctic model?~ Installed from source and it works as expected, amazing! 1. Download model https://huggingface.co/NousResearch/Meta-Llama-3-8B-Instruct for example 2. Put https://huggingface.co/Snowflake/snowflake-arctic-instruct/blob/main/quant_config.json into model dir 3....

I'm also experiencing this. At least, if num of prompt tokens is longer than num of GPU blocks * block_size, entire server stops working. Setting smaller max_model_len seems to alleviate...

I think increase of latency is expected unless integration of fo6_llm's kernel since dequantize and matmul are not fused in deepspeedfp implementation currently.

Could it be caused by existence of double ? Completion API automatically adds it while Chat API doesn't (because it's included in chat template).