centerformer Question about why the add&norm structure of the tranformer network differ from the typical transformer one

https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L267-L279 In the code, the residual in transformer is only the input after add and does not pass through the norm layer. add and norm are not taken as a whole, which is different from the typical transformer structure (the result of add and norm in series as a new level of input). Is there any special consideration for the design here?

Feb 19 '23 14:02 Liaoqing-up

I used prenorm inside each layer. https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L218-L238

Feb 19 '23 23:02 edwardzhou130

I used prenorm inside each layer.

https://github.com/TuSimple/centerformer/blob/96aa37503dc900d1aebeb7c1086c33bbd0c01d26/det3d/models/utils/transformer.py#L218-L238

I see, but I wonder if you have tried Add&Norm after each layer, which means the residual skip connect input are the features already passed through the Norm. Is it possible that the results of these two structures do not differ much?

Feb 20 '23 01:02 Liaoqing-up

Sorry, I haven't tried Add&Norm after each layer. Do you have experience with this before and would the results be better if you used this implementation?

Feb 22 '23 17:02 edwardzhou130