liangzelang comments

Results 14 comments of


                                            liangzelang

Can't find 'libpymo.py' file in aimet_common in any version

Has any methods to solve this issue, I met this same problem by using Aimet-common-1.28.0

[Feature Request] Do you have any plan to support CPU backend on Android devices?

> We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it...

[Feature Request] Do you have any plan to support CPU backend on Android devices?

> > > We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP,...

Prefill speed is approximately 4~6 tokens/s for Qwen1.5-1.8B

> > > In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing...

[Feature]: deepseek-v2 awq support

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16.

> What happens if you leave out the first call to `clblast::Gemv` for the measurement? Or better perhaps, just print out the individual times of each run. The first call...

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16.

Yeah, It's a good step. And I ddd compile option '-DVERBOSE=ON' to the CMake command-line, re-run clblast_tuner_xgemv, get detail below. ``` ./clblast_tuner_xgemv -m 2048 -n 16384 -precision 16 -warmup *...

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16.

Thanks, I did some experiments based on your suggestions; 1. Modify the transpose option in the test code, as follows; but the execution log shows that it is not gemv...

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16.

In fact, my goal is very simple, which is to do a matrix-vector multiplication of [2048, 16384] * [16384], and the performance is consistent with the performance in tuning.

Why clblast::Gemv API is slower than clblast_tune_xgemv when m = 2048 n=16384 precision=fp16.

Thank you for your answer 1. According to your suggestion, I transposed the A matrix in advance and configured clblast::Transpose::kYes, so that the API can select the XgemvFast kernel and...