Reset Number of Steps in WandB to the Latest Saved Checkpoint or Implement Distinguishable Experiment Run Logging in WandB

Open bigximik opened this issue 10 months ago • 0 comments

🎯 Goal (What & Why)

Currently, if we restart an experiment (e.g., due to a job being preempted), the iteration count in WandB will be higher than the actual step in the experiment. This happens because the experiment is likely restarted from scratch or from the latest saved checkpoint, causing all logged metrics to be skipped until they reach the previously submitted step.

To address this, we need to either:

Reset the step in WandB to reflect the actual step upon restart, or
Implement experiment runs logging in a way that allows WandB to distinguish them properly.

🚀 Execution Plan

(This section may start as an incomplete draft but must be defined before implementation begins.)

Step 1: What is the smallest working version?

(Describe the simplest way to implement this feature with minimal effort.)

Step 2: What additional optimizations are possible (but optional)?

(List potential refinements that can be added in later PRs if needed.)

📌 Acceptance Criteria (Must-Haves for Completion)

The feature must be functional and tested.
The implementation must be documented in practical terms.
The PR must include a performance/impact summary.
No refactors unless directly necessary for feature completion.

🛠️ Project Management

[ ] Assign the project to the Fast-LLM project.
[ ] Set the Estimate field (in days) in the GitHub project.
[ ] Use the Size field to categorize the PR size (Small/Medium/Large).
[ ] Assign an owner when opening the issue.

Mar 31 '25 07:03 bigximik