dpgen icon indicating copy to clipboard operation
dpgen copied to clipboard

How to restart training from NN in last iteration ?

Open Xi-yuanWang opened this issue 5 years ago • 8 comments

Dpgen trains the neuron network from scratch in every iteration, and I wonder how the restart training from the train stage of last iteration.

Xi-yuanWang avatar Nov 02 '20 04:11 Xi-yuanWang

Modify record.dpgen in your directory.

njzjz avatar Nov 02 '20 17:11 njzjz

You can add the following commands (for example) in the training model of param.json. In this case, from iteration 10 (training_reuse_iter), your training will restart based on the NN model of iteration 9. "training_reuse_iter": 10, "training_reuse_old_ratio": 0.2, "training_reuse_start_lr": 1e-4, "training_reuse_stop_batch": 200000, "training_reuse_start_pref_e": 0.1, "training_reuse_start_pref_f": 100,

Manyi-Yang avatar Nov 03 '20 11:11 Manyi-Yang

Thanks, but it seems that I have to modify the param.json for every new iteration. Is there a more automatic way?

Xi-yuanWang avatar Nov 04 '20 03:11 Xi-yuanWang

No, You have no need to modify the param.json for every new iteration. "training_reuse_iter": 10, means that from iteration, your training will always restart based on the NN model of the latest one.

Manyi-Yang avatar Nov 05 '20 10:11 Manyi-Yang

Thanks, then what does "training_reuse_old_ratio" mean?

Xi-yuanWang avatar Nov 05 '20 14:11 Xi-yuanWang

Since your training was based on the old model, which was trained using the structures generated from former iterations. and in the new training, we need to pay more attention to train on new configurations. So you can use this command to add only a partition of the structures from the former iteration to the new training set,.
"training_reuse_old_ratio" means: 0.2 means: in the new training set, only 20% structures are from the old iterations.

Manyi-Yang avatar Nov 06 '20 09:11 Manyi-Yang

Thanks, but when restarting from last iteration, the deepmd raise an error "probablity doesn't sum to 1". How to deal with it?

Xi-yuanWang avatar Nov 09 '20 00:11 Xi-yuanWang

It won't occur when restarting training from the train stage of the last iteration now. If your problem hasn't been solved yet, could you provide your data for reproduction?

HuangJiameng avatar Jul 12 '22 08:07 HuangJiameng