onnxruntime [js/webgpu] optimize matmulnbits operator

Description

Motivation and Context

Apr 18 '24 06:04 xhcao

@satyajandhyala PTAL, thanks.

Apr 19 '24 08:04 xhcao

Also add @guschmue @fs-eire

Apr 19 '24 08:04 xhcao

profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main

Apr 22 '24 19:04 guschmue

profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main

Hi, guschmue, what is the tested GPU？NV4090？

Apr 23 '24 05:04 xhcao

For Phi2, except the first token, when generating the other tokens, the inputs' shapes of matmulnbits are special, there are vector * matrix, the inputs are [1, 2560] * [2560, 2560] or [1, 2560] * [10240, 2560], and the results are [1, 2560] or [1, 10240]. It will launch many invalid logical threads if using the general algorithm, so add a special algorithm to handle the situation.

Apr 29 '24 08:04 xhcao

closing since we switch to new ep

May 01 '25 16:05 guschmue