Move Config Validations (e.g., Dataset Usage vs. Definitions) to `_validate` for Dry Run Checks

Open bigximik opened this issue 11 months ago • 1 comments

🎯 Goal (What & Why)

Most config validations are performed in _validate, but not all.
Some checks, such as ensuring all used datasets are defined, are currently in data object setup and trainer.
Moving as many remaining checks as possible to _validate will allow them to be executed during dry_run.

🚀 Execution Plan

Step 1

Identify all validation checks that can be moved to the appropriate _validate method.

Step 2

Implement the changes to ensure validations occur within _validate.

📌 Acceptance Criteria (Must-Haves for Completion)

The feature must be functional and tested.
The implementation must be documented in practical terms.
The PR must include a performance/impact summary.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

[x] Assign the project to the Fast-LLM project.
[ ] Set the Estimate field (in days) in the GitHub project.
[ ] Use the Size field to categorize the PR size (Small/Medium/Large).
[ ] Assign an owner when opening the issue.

Mar 28 '25 09:03 bigximik

AFAIK all checks that can be done during validation are done there. But some of them can't really be done during validation because of missing information The most importantly category is those that depend on external files (datasets, pretrained model), we can't strictly check those because we need to be able to do a dry-run in a different environment where these files might not exist (ex. locally before launching a job). A compromise might be to show a warning during validation if the file can't be found.

[Edit: looks like the pretrained config is loaded during validation, so we can't do a dry-run locally. This is a bit problematic.]

Mar 28 '25 21:03 jlamypoirier