SimMIM icon indicating copy to clipboard operation
SimMIM copied to clipboard

linear probe

Open aaronsarna opened this issue 3 years ago • 7 comments

In Table 6 and Section 4.2 of your paper, you show results on linear probe (even though I recognize that is not the major contribution of this work). It says that you use an "intermediate layer of ViT-B" for the linear probe. Can you specify which layer is used? Also, is the probe just on the [CLS] token output (which would be surprising, as this token has no loss on it during pretraining), or does it do something like average over the patch token outputs?

aaronsarna avatar Feb 04 '22 13:02 aaronsarna

Hi, @aaronsarna . We use the 8th layer of ViT for linear evaluation. In addition, we do not use the [CLS] token. Instead, we apply average pooling over the image patch tokens.

impiga avatar Feb 06 '22 13:02 impiga

Thanks, that was helpful.

I've been working on reproducing your ViT results in JAX, and so far it seems that the most critical piece to getting this all to work has been the removal of the LayerNorm at the end of ViT during pretraining. I don't see that mentioned anywhere in the paper, which made it very hard to spot. Without that, the linear probe performs pretty much at chance. If you release a new version of the paper it would be good to add that point.

I'm also curious whether the linear probe you use is just a linear layer, or if you do what MAE does and add a BatchNorm before it? It does seem at least necessary to add a LayerNorm to the output of ViT layer 8 to get decent performance.

aaronsarna avatar Feb 11 '22 17:02 aaronsarna

Hi, @aaronsarna , I'm also curious about the linear probing, may I have your settings of final experiment? It seems that BatchNorm before the linear layer dosen't give a good result. Thanks a lot for your reply.

oscar66oliver avatar Aug 05 '22 14:08 oscar66oliver

As I recall, the thing that helped the most for SimMIM linear probe was to mask attention to the mask tokens in the ViT during pretraining. If you don't do that then the unmasked images fed in during linear probe training are effectively out of distribution.

aaronsarna avatar Aug 05 '22 14:08 aaronsarna

@aaronsarna Thanks for your reply. You mean the SimMIM baseline can't provide the performance in the paper without removing the attention on the mask tokens manually?

oscar66oliver avatar Aug 05 '22 14:08 oscar66oliver

I wasn't able to reproduce it without that change. Very possible I had some bug though.

aaronsarna avatar Aug 05 '22 14:08 aaronsarna

Okay, thanks a lot for your help

oscar66oliver avatar Aug 05 '22 14:08 oscar66oliver