ModernBERT
ModernBERT copied to clipboard
Fix: Use default_factory for splits and make char fields optional in DatasetConstants
In my opinion, many users will utilize this script to experiment with their own datasets for training. However, the original DatasetConstants class had two main issues:
- The 'splits' attribute was declared as a class variable, causing it to be shared among all instances. This could lead to unexpected side effects.
- The char fields (chars_per_sample and chars_per_token) were required, which could be inconvenient for some executions.
Changes
- Updated the 'splits' attribute to use
field(default_factory=dict), ensuring each DatasetConstants instance gets its own independent dictionary. - Changed the type hints for
chars_per_sampleandchars_per_tokento Optional[int] with default values of None, making these parameters optional and the script more flexible.
This change is essential to avoid potential time waste for users who might otherwise encounter unexpected behavior when running the script on their own datasets.