Dynamic support for gemmlowp
Excited to see the update of dynamic gemm, is there any plan to update gemmlowp?
Hi @mason20240920
Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?
Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?
We need to make a llm program base on ACL, and in gpt, sequence length is dynamic. That's why we need dynamic operators
Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:
- What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?
- Are there any other operators that needs to work with dynamic shapes?
Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:
- What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?
- Are there any other operators that needs to work with dynamic shapes?
Thank you for your prompt response
- We are now using per-channel int8 quantization for weights and int8 for activation values, similar to the A8W8 approach used in QWen and Llama models.
- Yes, we need Softmax, Split, and other operations to be dynamic. While these kernels are relatively easy to modify for window size adjustments,we can change these kernel. GEMM and MatMul operations require assembly code to cooperate properly.
Thanks for the details @mason20240920,
- which hardware (or hardware features) are you targeting?
- Is output Int8 as well?
- Is there bias and is it Int32?
Thanks for the details @mason20240920,
- which hardware (or hardware features) are you targeting?
- Is output Int8 as well?
- Is there bias and is it Int32?
- Due to memory bandwidth limitations, CPUs currently remain the optimal solution for on-device inference of large models.
- Yes
- Bias can be int16, but int32 is okay