Sekri0
Sekri0
### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### 🐛 Describe the bug --enable-prefix-caching causing CUDA error: illegal memory access. According...
I noticed that the dimension of seqlen (variable M in engine/test.sh) in kernel Benchmark is very small. Does this mean that the test only considers the decode stage and ignores...
I use this command to quantize llama2-7b-chat model, but the model size dosen't change. CUDA_VISIBLE_DEVICES=0 python3 main.py \ --model /mnt/home/model/llama2-7b-chat-hf \ --epochs 20 --output_dir ./log/llama2-7b-w2a8 \ --eval_ppl --wbits 2 --abits...
As mentioned in README, [Note that due to the limitations of AutoGPTQ kernels, the real quantization of weight-only quantization can only lead memory reduction, but with slower inference speed.] I'm...
Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节) ## 🐛 Bug 运行demo_onnx.py报错 demo_onnx.py:  报错信息: Traceback (most recent call last): File "/workspace/data/wcy/code/SenseVoice/demo_onnx.py", line...