Special tokens masking candidates
Based on the masking implementation in the transformers library, special tokens (e.g., [CLS], [SEP]) should be excluded from the masking process. However, upon reviewing the implementation in sequence_packer.py,
https://github.com/AnswerDotAI/ModernBERT/blob/8c57a0f01c12c4953ead53d398a36f81a4ba9e38/src/sequence_packer.py#L284
it appears that these tokens are currently being treated as valid masking candidates.
Could you please confirm if this behavior is intentional? If not, I suggest updating the masking logic to explicitly exclude special tokens. For instance, adding a condition to filter out these tokens before applying the mask would ensure consistency with the transformers library's approach. Additionally, incorporating unit tests to verify that special tokens remain unmasked would improve code reliability.
Am I correct in my understanding, or is there something I might be missing?
Thank you for looking into this.
Hi! I noticed the same thing and found your issue. Have you tried training ModernBERT using the new setup where [CLS] and [SEP] tokens are not masked? If so, did you observe any unusual behavior in the loss curve?
I’m currently training the model on textual data that isn’t natural language, so I had to implement a custom masking strategy. When I exclude [CLS] and [SEP] from masking, the loss gives sudden spikes during training. But when I include masking for those tokens again, the loss curve becomes smooth and stable.