Incomplete implementation of SparseGEMV
https://github.com/FasterDecoding/TEAL/blob/fb7373c93ac3594817c9ee64d4e08b47430a1822/kernels/sparse_gemv.py#L271
Hi, I notice that the SparseGEMV kernel only manage the case when batch_size=1 & seqlen=1. Beyond that case, the kernel outputs wrong answer.
Is it expected that this kernel only work for decoding stage? Then where is the implementation about Appendix A4?
They mentioned in A4 that single-batch setting is used. That said, I don't think it's appropriate to compare 2:4 sparsity here as 2:4 sparsity is not fit for small-batch matmul's.
I think this method(therefore the kernel) only focus on single-batch setting in decoding stage, as other activation sparsity also only focus on them. The advantage of activation sparsity is latency, not a throughput I think, and Appendix A4 might be focused on latency(as speedup)
yes its primarily a decoding kernel. A.4 is based on loss evals, where the sparsity (both weight and activation) is simulated