shifeiwen

Results 16 comments of shifeiwen

@quic-mangal In CNN, we often use per-channel or per-layer granularity to quantify the convolution kernel. But the main operation in LLM is matrix multiplication. When performing matrix multiplication, we can...

Are there any new updates to this discussion currently?

update ![image](https://github.com/mlc-ai/mlc-llm/assets/147359299/e4fc42ee-3fa3-4c83-ab9c-2084a8817b2e) I found that there are some small operations in the middle of each operator, and these operations will take a lot of time. I don’t know if these...

@FdyCN The problem seems to be that htp backend has many limitations, including the size of memory requested and the speed of memory. However, Qualcomm has promoted in some videos...

I have tried to implement 1.1b llama in hexagon backend before and it was very slow because I did not use cpu scheduling and only added hvx compilation instructions when...

@FdyCN Yes, there are currently some ways to support mlc running in hexagon backend, but I tested it very slowly. Each token of 1.1b llama takes more than 60s (there...

+1 Some NPUs cannot bind threads like CPUs, so PagedKVCache cannot be used. Is there any latest progress?

Same error. Is there any progress on this issue so far? @0x1997

@ChenMnZ Do you have any progress or tips on this? It can be that I successfully loaded and the quant weight in mlc.

Hi, @MasterJH5574 I followed the instructions in gemv and set the loop unrolling value to 8. Running the opencl kernel on 8gen2 does not cause errors. There is no significant...