Embed box before multihead attention
Thank you for your idea and repo. Since box embedding and w_g stay same in multi-turn multihead attention and they do not rely on k,q,v. Is it proper to move box embedding process to the begining of multihead attention to avoid embedding box in each EncoderLayer again and again? I have tried this and found it can reduce XE training time from 22h to 18h(on GTX 1080Ti) without obvious performance degradation (from CIDEr 1.1495 to CIDEr 1.1485)
@luo3300612 Thanks for your observations.
Equations (6) and (7) in the paper, show that indeed the box_embedding Emb(\lambda) is just a function of the bounding box displacements, and therefore constant for all the self-attention layers of the transformer encoder.
Therefore, as you say, the computation of Emb(\lambda) can be moved out of the self-attention layer.
However, as you can see in equation (7), the geometric weights w_g are a function of a learnable weight matrix W_G.
This learnable matrices are allowed to be different for different self-attention layers.
Therefore, the computation of w_g cannot be moved out of the self-attention layer.
Here is the computation of w_g in our code (Notice the linear layer l()):
https://github.com/yahoo/object_relation_transformer/blob/f21674d5c1095fc104ff3b69bfa41cfeea7568db/models/RelationTransformerModel.py#L293
Does this answer your question?