Yufeng Li
Yufeng Li
Could you explain a little bit more of the support of variable-length? Does it mean the runtime can support inputs with different sequences in a single session, like [batch, 8],...
Fix bug: #12556.
### Description the crash caused by the neural_speed turns out to be a very corn case. Turn it on by default. ### Motivation and Context
N * blks can be odd. There is no need to iterate by 2 and scales[i + 1] / 16 causes heap-buffer-overflow: https://github.com/intel/neural-speed/blob/bc5ee16f73d941afe80914bdf9c9c9523c39c576/bestla/bestla/bestla_prologue_b.h#L460
The original awq(https://github.com/mit-han-lab/llm-awq) has a fake backend support. Could we add a support of it? It is very useful to save the model as fp16/fp32 and convert them to other...