[js/webgpu] optimize matmulnbits operator
Description
Motivation and Context
@satyajandhyala PTAL, thanks.
Also add @guschmue @fs-eire
profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main
profiled with the change - the kernel times for MatMulNBits.float16 are significant (like 8x) higher then main
Hi, guschmue, what is the tested GPU?NV4090?
For Phi2, except the first token, when generating the other tokens, the inputs' shapes of matmulnbits are special, there are vector * matrix, the inputs are [1, 2560] * [2560, 2560] or [1, 2560] * [10240, 2560], and the results are [1, 2560] or [1, 10240]. It will launch many invalid logical threads if using the general algorithm, so add a special algorithm to handle the situation.
closing since we switch to new ep