guanxiao_li
Results
2
comments of
guanxiao_li
I have the same question of (1), did you figure it out? About (2), there are 2 layers of mlp in single transformer block, as in `class Mlp(nn.Module):`
Hi @choasup, how did `((2 ** (self.num_layers - 1)) ** 2)` simplified into `(2 ** self.num_layers)`? I'm so confused about the calculation of FLOPs for norm here, as in #165...