Refactoring of Evaluation and adding of evaluate command

Open bigximik opened this issue 9 months ago • 0 comments

✨ Description

Creates Evaluator abstraction so additional evaluators beyond Loss can be added.

Adds an evaluate command that accepts the same training config and enables evaluation on the last checkpoint.

Includes some fixes.

Example: specifying multiple LossEvaluators

training:
  evaluators:
    the_stack:
      run_interval:
        interval: 50
      evaluator:
        type: loss
        iterations: 25
        dataset_name: the_stack
    fineweb:
      run_interval:
        interval: 100
      evaluator:
        type: loss
        iterations: 15
        dataset_name: fineweb
data:
  datasets:
    the_stack:
      type: file
      path: path/to/validation_the_stack_dataset.yaml
    fineweb:
      type: file
      path: path/to/validation_fineweb_dataset.yaml

🔍 Type of change

Select all that apply:

[x] 🐛 Bug fix (non-breaking change that addresses a specific issue)
[x] 🚀 New feature (non-breaking change that adds functionality)
[ ] ⚠️ Breaking change (a change that could affect existing functionality)
[ ] 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
[x] 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
[ ] 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
[x] 📝 Documentation change (updates documentation, including new content or typo fixes)
[ ] 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

[x] 📜 I have read and followed the contributing guidelines.
[x] 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
[x] 🎉 The functionality is complete, and I have tested the changes.
[x] 📝 I have updated the documentation if needed.
[x] ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
[x] 🧩 I have commented my code, especially in hard-to-understand areas.

Testing

[x] 🧪 I have added or updated tests to cover my changes.
[x] ✔️ New and existing tests pass locally with my changes.
[x] 🚦 I have tested these changes on GPUs and verified training stability.

May 14 '25 08:05 bigximik