ComputeLibrary icon indicating copy to clipboard operation
ComputeLibrary copied to clipboard

Dynamic support for gemmlowp

Open mason20240920 opened this issue 10 months ago • 6 comments

Excited to see the update of dynamic gemm, is there any plan to update gemmlowp?

mason20240920 avatar Mar 10 '25 07:03 mason20240920

Hi @mason20240920

Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?

morgolock avatar Mar 12 '25 10:03 morgolock

Hi @mason20240920

Would you please share more details of your use-case? Are you trying to run a model that requires dynamic shapes support in gemmlowp? Which one?

We need to make a llm program base on ACL, and in gpt, sequence length is dynamic. That's why we need dynamic operators

mason20240920 avatar Mar 12 '25 10:03 mason20240920

Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:

  • What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?
  • Are there any other operators that needs to work with dynamic shapes?

gunes-arm avatar Mar 12 '25 13:03 gunes-arm

Hi @mason20240920 , there have been a few use cases mentioned to us around Gemmlowp; so it's not strictly planned to be delivered, but I can say it's in discovery phase. I have two questions:

  • What is the data type combinations you are interested in? i.e. Is it Int8, UInt8 etc.?
  • Are there any other operators that needs to work with dynamic shapes?

Thank you for your prompt response

  • We are now using per-channel int8 quantization for weights and int8 for activation values, similar to the A8W8 approach used in QWen and Llama models.
  • Yes, we need Softmax, Split, and other operations to be dynamic. While these kernels are relatively easy to modify for window size adjustments,we can change these kernel. GEMM and MatMul operations require assembly code to cooperate properly.

mason20240920 avatar Mar 13 '25 01:03 mason20240920

Thanks for the details @mason20240920,

  • which hardware (or hardware features) are you targeting?
  • Is output Int8 as well?
  • Is there bias and is it Int32?

gunes-arm avatar Mar 14 '25 11:03 gunes-arm

Thanks for the details @mason20240920,

  • which hardware (or hardware features) are you targeting?
  • Is output Int8 as well?
  • Is there bias and is it Int32?
  1. Due to memory bandwidth limitations, CPUs currently remain the optimal solution for on-device inference of large models.
  2. Yes
  3. Bias can be int16, but int32 is okay

mason20240920 avatar Mar 16 '25 11:03 mason20240920