ComputeLibrary Dynamic support for gemmlowp

Excited to see the update of dynamic gemm, is there any plan to update gemmlowp？

Mar 10 '25 07:03 mason20240920

Hi @mason20240920

Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?

Mar 12 '25 10:03 morgolock

Hi @mason20240920

Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?

We need to make a llm program base on ACL, and in gpt, sequence length is dynamic. That's why we need dynamic operators

Mar 12 '25 10:03 mason20240920

Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:

What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?
Are there any other operators that needs to work with dynamic shapes?

Mar 12 '25 13:03 gunes-arm

Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:

What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?

Are there any other operators that needs to work with dynamic shapes?

Thank you for your prompt response

We are now using per-channel int8 quantization for weights and int8 for activation values, similar to the A8W8 approach used in QWen and Llama models.
Yes, we need Softmax, Split, and other operations to be dynamic. While these kernels are relatively easy to modify for window size adjustments,we can change these kernel. GEMM and MatMul operations require assembly code to cooperate properly.

Mar 13 '25 01:03 mason20240920

Thanks for the details @mason20240920,

which hardware (or hardware features) are you targeting?
Is output Int8 as well?
Is there bias and is it Int32?

Mar 14 '25 11:03 gunes-arm

Thanks for the details @mason20240920,

which hardware (or hardware features) are you targeting?

Is output Int8 as well?

Is there bias and is it Int32?

Due to memory bandwidth limitations, CPUs currently remain the optimal solution for on-device inference of large models.
Yes
Bias can be int16, but int32 is okay

Mar 16 '25 11:03 mason20240920