Online dataset mixing based on validation metrics

Open bigximik opened this issue 11 months ago • 1 comments

🎯 Goal (What & Why)

Create a Blended Dataset which can use validation metrics to re-arrange its sampling probabilities of different subsets. This will allow to implement on the fly training data readjustment.

🚀 Execution Plan

TBD

📌 Acceptance Criteria (Must-Haves for Completion)

TBD

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[ ] Set the Estimate field (in days) in the GitHub project.
[ ] Use the Size field to categorize the PR size (Small/Medium/Large).
[ ] Assign an owner when opening the issue.

Created from @oleksost comments:

Another motivation for implementing multi-dataset validation is online mixing as described in here and here.

Roughly the idea is that if we can track the loss separately on each of the mixed domains, we can dynamically > adopt the mixing coefficients online.

Mar 24 '25 12:03 bigximik

@oleksost can you help fleshing this out? not sure what the intended scope of this is. it would depend on #151, doesn't it?

Mar 24 '25 13:03 tscholak