Move Config Validations (e.g., Dataset Usage vs. Definitions) to `_validate` for Dry Run Checks
🎯 Goal (What & Why)
Most config validations are performed in _validate, but not all.
Some checks, such as ensuring all used datasets are defined, are currently in data object setup and trainer.
Moving as many remaining checks as possible to _validate will allow them to be executed during dry_run.
🚀 Execution Plan
Step 1
- Identify all validation checks that can be moved to the appropriate
_validatemethod.
Step 2
- Implement the changes to ensure validations occur within
_validate.
📌 Acceptance Criteria (Must-Haves for Completion)
- The feature must be functional and tested.
- The implementation must be documented in practical terms.
- The PR must include a performance/impact summary.
- No refactors unless directly necessary for feature completion.
🛠️ Project Management
- [x] Assign the project to the Fast-LLM project.
- [ ] Set the
Estimatefield (in days) in the GitHub project. - [ ] Use the
Sizefield to categorize the PR size (Small/Medium/Large). - [ ] Assign an owner when opening the issue.
AFAIK all checks that can be done during validation are done there. But some of them can't really be done during validation because of missing information The most importantly category is those that depend on external files (datasets, pretrained model), we can't strictly check those because we need to be able to do a dry-run in a different environment where these files might not exist (ex. locally before launching a job). A compromise might be to show a warning during validation if the file can't be found.
[Edit: looks like the pretrained config is loaded during validation, so we can't do a dry-run locally. This is a bit problematic.]