Medusa icon indicating copy to clipboard operation
Medusa copied to clipboard

Question about the Tree Attention Mechanism

Open chansonzhang opened this issue 1 year ago • 0 comments

Suppose the first MEDUSA head generates the top-2 predictions "It is" and "It's", while the second MEDUSA head generates the top-3 predictions "difficult", "a", and "not". This results in a total of 2 × 3 = 6 candidates.

The tree-structured attention mechanism ensures that each token can only attend to its predecessors within the same continuation. For instance, the token "difficult" can only attend to "It is" or "It's", but not to "not" or "a", as they belong to different continuations.

So,

  • "difficult" can attend to "It is".
  • "difficult" is generated by MEDUSA head 2, and "It is" is generated by MEDUSA head 1.
  • head 2 and head 1 are running in parallel.

This means when head 2 is generating "difficult", "It is" has not necessarily already been generated by head 1. If "It is" has not been generated at that moment "difficult" is being generated, how can "difficult" attend to the not yet exist "It is"?

chansonzhang avatar Nov 18 '24 08:11 chansonzhang