dvc Repeat an experiment

Sometimes, experiments may fail for some reasons related with external factors and these experiment could have been launched from code using random search grid so that repeating them is not easy. So, it would be handy two functionalities:

Repeating failed experiments overwriting the appropiate parameters (Data, metrics, etc.)

dvc exp run --failed

Repeting a experiment overwriting its apropiate parameters giver its name:

dvc exp run -n petete --repeat

Jan 23 '23 14:01 pablo-campillo

Putting this as p3 for now due to too many competing priorities, but seems pretty important for working with queued experiments and sweeps.

Jan 23 '23 20:01 dberenbaum

I vote for this as well. Sometimes the experiment doesn't finish due to hardware issues and thus I would like to repeat/resume those conveniently!

Dec 13 '23 17:12 lefos99

I will add to this and try to explain why it is a very important feature for me at least. I tend to use spot machines to train models since they are cheaper. However, they may be terminated at any time and so multiple experiments fail since I run them in parallel. I do use grid search, but more often than not I will do multiple different sweeps at a time and that makes it difficult to repeat the experiment.

To expand on this, it would be awesome if DVC also differentiated between a failed experiment due to an internal error or a failed experiment due to something external, although I'm not sure how feasible this is.

Anyway, in hopes of helping anyone with this issue, I have a little script for helping me retry failed experiments. Be warned though, I haven't tested it much:

#!/bin/bash

dvc queue status | grep "Failed" | awk '{print $2}' | while read -r name; do
    dvc exp apply $name
    dvc exp run --queue
done

Dec 22 '23 13:12 henrypickler