Special tokens masking candidates

Open jihobak opened this issue 10 months ago • 1 comments

Based on the masking implementation in the transformers library, special tokens (e.g., [CLS], [SEP]) should be excluded from the masking process. However, upon reviewing the implementation in sequence_packer.py, https://github.com/AnswerDotAI/ModernBERT/blob/8c57a0f01c12c4953ead53d398a36f81a4ba9e38/src/sequence_packer.py#L284 it appears that these tokens are currently being treated as valid masking candidates.

Could you please confirm if this behavior is intentional? If not, I suggest updating the masking logic to explicitly exclude special tokens. For instance, adding a condition to filter out these tokens before applying the mask would ensure consistency with the transformers library's approach. Additionally, incorporating unit tests to verify that special tokens remain unmasked would improve code reliability.

Am I correct in my understanding, or is there something I might be missing?

Thank you for looking into this.

Mar 19 '25 07:03 jihobak

Hi! I noticed the same thing and found your issue. Have you tried training ModernBERT using the new setup where [CLS] and [SEP] tokens are not masked? If so, did you observe any unusual behavior in the loss curve?

I’m currently training the model on textual data that isn’t natural language, so I had to implement a custom masking strategy. When I exclude [CLS] and [SEP] from masking, the loss gives sudden spikes during training. But when I include masking for those tokens again, the loss curve becomes smooth and stable.

Aug 05 '25 12:08 katyachemistry