Why don't we use the pruned tokens to compute attention in the prefill stage?

Open xinhaoH opened this issue 6 months ago • 0 comments

I noticed that in the prefill stage, although we prune the token number to max capacity prompt (e.g, 2k), we still use full attention to compute attention. For example, we input a 6k prompt to generate a response, and in the prefill stage, we cache the 2k most important tokens. However, we still use 6k instead of 2k to compute attention.

Why don't we use the pruned 2k tokens to compute attention in the prefill stage?

Jul 29 '25 11:07 xinhaoH