CLIP attention mask

I would like to understand why masking is used in the text encoder. This doesn't seem necessary for CLIP since it does not perform an autoregressive task. Maybe my understanding is incomplete. The relevant code is located at line 286 in model.py.

Mar 11 '25 04:03 ZOUHAN1

What up @ZOUHAN1. The paper released with the repo mentions on page five that "Masked self-attention was used in the text encoder to preserve the ability to initialize with a pre-trained language model or add language modeling as an auxiliary objective...". From what I'm hearing, the authors implemented masked attention just in case they wanted to experiment with using pre-trained models down the line.

Apr 03 '25 05:04 wadkisson

Hi @ZOUHAN1, even though its not an AR task, masking is still needed to mask the PAD tokens or the tokens that correspond to id = 0. If masking was not used, the text encoder will attend to the PAD tokens (or tokens with id=0) and it has no meaning whatsoever.

Now you may ask, why there are PAD tokens in the first place? Its because the text encoder expects a fixed sequence length of 77 tokens in CLIP. If there aren't enough real tokens to reach the length 77, it will simply add these PAD tokens.

This is true for any kind of text transformers, not just CLIP. I hope it answers your question.

Sep 07 '25 10:09 roysubhankar