ModernBERT
ModernBERT copied to clipboard
Exclude Special Tokens from Masking in mlm_masking
Modified the mlm_masking method to exclude special tokens from the masking process. Previously, the method applied masking uniformly to all tokens without considering special tokens.
Changes
- Added a
special_tokensparameter to allow specifying token IDs (e.g., [CLS], [SEP]) to be excluded from masking. - Updated the eligible mask logic to filter out both special tokens and pad tokens before applying masking.
- Refactored the masking process to ensure that only eligible tokens are considered for the 80/10/10 masking scheme.
Discussions
- Issue #212 regarding this change.
- If needed, I will add unit tests for this function in a subsequent commit.
Tests
- [ ] Is the new feature tested? (Not always necessary for all changes -- just adding to the checklist to keep track)
- [ ] Have you ran all the tests?
- [ ] Do the tests all pass?
- [ ] If not, have you included an explanation of which tests this PR breaks and/or why (below this checklisT)