bulaikexiansheng comments

Results 10 comments of


                                            bulaikexiansheng

关于token生成速率的计算问题

感谢您的解答！请问您能提供github中在4090上evaluation的Falcon 40B和LLaMA 70B所使用的代码吗？我的代码如下： `./build/bin/main -m /data/models/falcon-40b-relu-powerinfer/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "In the depths of twilight, where shadows dance with whispers, ancient secrets stir beneath the surface, beckoning the curious...

关于在A100显卡上测得的效果异常的疑问

补充一下我做的实验过程，每台机器上间隔5秒跑10次取平均值`./build/bin/main -m ./[PowerInfer/ReluLLaMA-70B-PowerInfer-GGUF/llama2-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"`代码。还有一个疑问就是，按理来说，无论生成序列长度是多少，load_time不应在A100波动这么大才对。

在A100-80G上无法找到cuda的情况

抱歉，我提供的日志看起来很凌乱，下面可能会清楚一些： (base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos Log start main: build = 1578 (906830b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0...

Performance of KAN in some image classification datasets

@Boreuseful Thanks for your benchmark! Could you share your code? I want to compare the performance between MLP and KAN. But i find that memory error happening if i do...

Some questions that arise when studying your code

I have run Medusa code and trained the draft model. I found that it also needs to run the script to generate corresponding `choices` for the trained model. It seems...

GPTQ got an unexpected inference speed compared with fp16 for llama-7b

Is the problem solved? I encountered a similar problem I use `generation_speed.py` script to compare the speed between original model and quantized model on RTX4090.But i met the result below:...

The error message that appears when I set use_marlin=True

If i set `use_triton=True` ``` /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @custom_fwd /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @custom_bwd /root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)`...

The progress bar does not reflect for a long time

> 您好，感谢您对我们的工作感兴趣！没有任何问题（这是意料之中的）。这是因为由于卸载和有限的计算资源，预填充阶段相当长。你需要等待。一种可能的解决方案是使用预填充加速技术，例如 [MInference](https://github.com/microsoft/MInference)。 Thanks for your reply! How long did it take you to test on the RTX4090? Will it be faster if I run it on the A100-80G?

The progress bar does not reflect for a long time

I noticed the `--prefill` command line parameter, and I thought I could set it smaller to run through the code. I set it to 104, but it seems to report...

The progress bar does not reflect for a long time

> 104 maybe too small, try 32768 (32k) Thanks，it works！ ``` [Overall Latency]: 0.08335429636659987 [Overall Avg Accepted Tokens]: 11.336667760098464 ```