PD-MeshNet icon indicating copy to clipboard operation
PD-MeshNet copied to clipboard

Launch training jobs

Open FlorianBertonBrightClue opened this issue 4 years ago • 1 comments

it seems that there is a issue when you launch for the first time a training jobs.

In base_training_job.py line 203 you check if the checkpoint subfolder exists and if not you create it. However this directory is a child of log_folder/training_job_name

Then line 217 you check if the log folder : log_folder/training_job_name exists in order to know if the training should init it and the parameters or used a checkpoints.

The issue is that this folder is sure to exists as you just created it before line 203. At this point the boolean __found_job_folder is True. This means that a file ".yml" should be present which is not the case.

And so when we go in __initialize_training_job() instead of saving the parameters we try to load it (line 747), and then an error is raised in __load_training_parameters()

FlorianBertonBrightClue avatar Feb 24 '21 09:02 FlorianBertonBrightClue

I think this is a bug and have tried to fix it. Please refer to PR https://github.com/MIT-SPARK/PD-MeshNet/pull/10

HarmonJiang avatar Jan 20 '22 16:01 HarmonJiang