ModernBERT icon indicating copy to clipboard operation
ModernBERT copied to clipboard

Exclude Special Tokens from Masking in mlm_masking

Open jihobak opened this issue 10 months ago • 0 comments

Modified the mlm_masking method to exclude special tokens from the masking process. Previously, the method applied masking uniformly to all tokens without considering special tokens.

Changes

  • Added a special_tokens parameter to allow specifying token IDs (e.g., [CLS], [SEP]) to be excluded from masking.
  • Updated the eligible mask logic to filter out both special tokens and pad tokens before applying masking.
  • Refactored the masking process to ensure that only eligible tokens are considered for the 80/10/10 masking scheme.

Discussions

  • Issue #212 regarding this change.
  • If needed, I will add unit tests for this function in a subsequent commit.

Tests

  • [ ] Is the new feature tested? (Not always necessary for all changes -- just adding to the checklist to keep track)
  • [ ] Have you ran all the tests?
  • [ ] Do the tests all pass?
  • [ ] If not, have you included an explanation of which tests this PR breaks and/or why (below this checklisT)

jihobak avatar Mar 25 '25 09:03 jihobak