bulaikexiansheng

Results 10 comments of bulaikexiansheng

感谢您的解答!请问您能提供github中在4090上evaluation的Falcon 40B和LLaMA 70B所使用的代码吗?我的代码如下: `./build/bin/main -m /data/models/falcon-40b-relu-powerinfer/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "In the depths of twilight, where shadows dance with whispers, ancient secrets stir beneath the surface, beckoning the curious...

补充一下我做的实验过程,每台机器上间隔5秒跑10次取平均值`./build/bin/main -m ./[PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF/llama2-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"`代码。还有一个疑问就是,按理来说,无论生成序列长度是多少,load_time不应在A100波动这么大才对。

抱歉,我提供的日志看起来很凌乱,下面可能会清楚一些: (base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos Log start main: build = 1578 (906830b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0...

@Boreuseful Thanks for your benchmark! Could you share your code? I want to compare the performance between MLP and KAN. But i find that memory error happening if i do...

I have run Medusa code and trained the draft model. I found that it also needs to run the script to generate corresponding `choices` for the trained model. It seems...

Is the problem solved? I encountered a similar problem I use `generation_speed.py` script to compare the speed between original model and quantized model on RTX4090.But i met the result below:...

If i set `use_triton=True` ``` /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)`...

> 您好,感谢您对我们的工作感兴趣!没有任何问题(这是意料之中的)。这是因为由于卸载和有限的计算资源,预填充阶段相当长。你需要等待。一种可能的解决方案是使用预填充加速技术,例如 [MInference](https://github.com/microsoft/MInference)。 Thanks for your reply! How long did it take you to test on the RTX4090? Will it be faster if I run it on the A100-80G?

I noticed the `--prefill` command line parameter, and I thought I could set it smaller to run through the code. I set it to 104, but it seems to report...

> 104 maybe too small, try 32768 (32k) Thanks,it works! ``` [Overall Latency]: 0.08335429636659987 [Overall Avg Accepted Tokens]: 11.336667760098464 ```