ModernBERT icon indicating copy to clipboard operation
ModernBERT copied to clipboard

Fix: Use default_factory for splits and make char fields optional in DatasetConstants

Open jihobak opened this issue 10 months ago • 0 comments

In my opinion, many users will utilize this script to experiment with their own datasets for training. However, the original DatasetConstants class had two main issues:

  1. The 'splits' attribute was declared as a class variable, causing it to be shared among all instances. This could lead to unexpected side effects.
  2. The char fields (chars_per_sample and chars_per_token) were required, which could be inconvenient for some executions.

Changes

  • Updated the 'splits' attribute to use field(default_factory=dict), ensuring each DatasetConstants instance gets its own independent dictionary.
  • Changed the type hints for chars_per_sample and chars_per_token to Optional[int] with default values of None, making these parameters optional and the script more flexible.

This change is essential to avoid potential time waste for users who might otherwise encounter unexpected behavior when running the script on their own datasets.

jihobak avatar Mar 25 '25 11:03 jihobak