onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[js/webgpu] optimize matmulnbits operator

Open xhcao opened this issue 1 year ago • 5 comments

Description

Motivation and Context

xhcao avatar Apr 18 '24 06:04 xhcao

@satyajandhyala PTAL, thanks.

xhcao avatar Apr 19 '24 08:04 xhcao

Also add @guschmue @fs-eire

xhcao avatar Apr 19 '24 08:04 xhcao

profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main

guschmue avatar Apr 22 '24 19:04 guschmue

profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main

Hi, guschmue, what is the tested GPU?NV4090?

xhcao avatar Apr 23 '24 05:04 xhcao

For Phi2, except the first token, when generating the other tokens, the inputs' shapes of matmulnbits are special, there are vector * matrix, the inputs are [1, 2560] * [2560, 2560] or [1, 2560] * [10240, 2560], and the results are [1, 2560] or [1, 10240]. It will launch many invalid logical threads if using the general algorithm, so add a special algorithm to handle the situation.

xhcao avatar Apr 29 '24 08:04 xhcao

closing since we switch to new ep

guschmue avatar May 01 '25 16:05 guschmue