liangzelang

Results 14 comments of liangzelang

Has any methods to solve this issue, I met this same problem by using Aimet-common-1.28.0

> We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP, but it...

> > > We ever used Hexagon DSP backend to test tinillama-1.1B(q4f16_0). It is too slow , needs 20 seconds to decode 1 token. There is HTP on Qualcomm DSP,...

> > > In the npuExe.run function, the QNN graph is built and then executed, during which the building graph stage takes most of the time. The loading and constructing...

> What happens if you leave out the first call to `clblast::Gemv` for the measurement? Or better perhaps, just print out the individual times of each run. The first call...

Yeah, It's a good step. And I ddd compile option '-DVERBOSE=ON' to the CMake command-line, re-run clblast_tuner_xgemv, get detail below. ``` ./clblast_tuner_xgemv -m 2048 -n 16384 -precision 16 -warmup *...

Thanks, I did some experiments based on your suggestions; 1. Modify the transpose option in the test code, as follows; but the execution log shows that it is not gemv...

In fact, my goal is very simple, which is to do a matrix-vector multiplication of [2048, 16384] * [16384], and the performance is consistent with the performance in tuning.

Thank you for your answer 1. According to your suggestion, I transposed the A matrix in advance and configured clblast::Transpose::kYes, so that the API can select the XgemvFast kernel and...