BasicSR icon indicating copy to clipboard operation
BasicSR copied to clipboard

first stage of train.py for ESRGAN is slow and single threaded

Open gtnbssn opened this issue 4 years ago • 5 comments

Hi,

not sure if this is a bug, maybe just a general question.

When using extract_images.py, it was possible to run it on multiple threads and save a lot of time.

I am starting the training now, the original dataset has almost 4000 images, so the resulting dataset with the subimages is close to 100GB.

When i start the training there is a first pretty long phase (many hours in my case), during which there is only one python process running. Is there a way to make this phase multithreaded to save time on this part of the training? It seems this part does not use the GPU also, so i need to wait until it is finished to see if the GPU is indeed working, or even see other issues with my config file.

Thank you so much for this amazing repo and all the work!

gtnbssn avatar May 26 '21 03:05 gtnbssn

You mean when prefetch the data?i meet the same qusetion, the info show model[] is created,and then dont show anything.

wyywyyyyw avatar May 29 '21 02:05 wyywyyyyw

aaaah so there is indeed an option to used CUDA instead of cpu for prefetch! I'll try this out in a couple days. Maybe this will help!

It is the prefetch_mode: key in the yaml file if i understand correctly.

gtnbssn avatar Jun 03 '21 07:06 gtnbssn

@gtnbssn Did you trained on a custom dataset?

Samjith888 avatar Jun 17 '21 12:06 Samjith888

Yes.

gtnbssn avatar Jun 22 '21 06:06 gtnbssn

IMO, the time consuming operation is here:

https://github.com/xinntao/BasicSR/blob/5c757162b348a09d236e00c2cc04463c0a8bba45/basicsr/data/data_sampler.py#L33

The operation here is intended to keep reproducibility.

If you have a large dataset, try to reduce dataset_enlarge_ratio in the configuration file and keep the Require iter number per epoch is just greater than Total iters in logging output.

An example snippet of logging output:

2021-06-20 16:47:42,337 INFO: Training statistics:
        Number of train images: 64612                                           
        Dataset enlarge ratio: 1000                                               
        Batch size per gpu: 16                                                      
        World size (gpu number): 4                                                  
        Require iter number per epoch: 1009563                                    
        Total epochs: 1; iters: 600000.

wwhio avatar Jun 23 '21 07:06 wwhio