[Feature] Sort out the training process file storage structure

Open BayMaxBHL opened this issue 1 year ago • 0 comments

What is the feature?

In the training process, commonly used to save the file structure: config dump -> workdir log.txt -> _log_dir(workdir+tempstemp) checkpoint（best） -> workdir checkpoint（iter、epoch）and txt -> workdir vis（tensorboard） -> workdir

In addition, I'll customize hooks to save the project code and validate the visualizations. I have tried several ways to make each training file saved in a folder.Will encounter some problems.

Method 1: Before creating runner, change workdir to workdir+experment_name(tempstemp). However, tempstemp may be inconsistent due to multiple cards. You need to put dist_init in front of the runner, modify it, and then create the runner.

Method 2: Inherit runner and unify the save path to _log_dir(workdir+tempstemp). Because _log_dir is written dead. config dump and save checkpoint need to be rewritten to achieve this with minimal changes. However, if you save the checkpoint (iter, epoch) and txt, the txt will be stored in workdir. As a result, only the last three checkpoints cannot be saved.

Although the above two methods can indirectly complete the purpose, but the feeling of sewing is very uncomfortable. From the save logic, workdir should be self.workdir+self._experment_name. For example, the XX experiment I want to do has been done many times. For example, the save path is XX, experiment A, experiment B... . Although runner's init has _experment_name, the effect is not represented. I also understand that multiple save files may be scattered into different paths, but some are written as fixed, which can only be modified by the inherited runner, and some are written in the runner's init, which cannot be modified even if I add hooks. It kills me with Obsessive-compulsive disorder.

Any other context?

No response

Oct 20 '24 03:10 BayMaxBHL