BasicSR first stage of train.py for ESRGAN is slow and single threaded

Hi,

not sure if this is a bug, maybe just a general question.

When using extract_images.py, it was possible to run it on multiple threads and save a lot of time.

I am starting the training now, the original dataset has almost 4000 images, so the resulting dataset with the subimages is close to 100GB.

When i start the training there is a first pretty long phase (many hours in my case), during which there is only one python process running. Is there a way to make this phase multithreaded to save time on this part of the training? It seems this part does not use the GPU also, so i need to wait until it is finished to see if the GPU is indeed working, or even see other issues with my config file.

Thank you so much for this amazing repo and all the work!

May 26 '21 03:05 gtnbssn

You mean when prefetch the data?i meet the same qusetion, the info show model[] is created,and then dont show anything.

May 29 '21 02:05 wyywyyyyw

aaaah so there is indeed an option to used CUDA instead of cpu for prefetch! I'll try this out in a couple days. Maybe this will help!

It is the prefetch_mode: key in the yaml file if i understand correctly.

Jun 03 '21 07:06 gtnbssn

@gtnbssn Did you trained on a custom dataset?

Jun 17 '21 12:06 Samjith888

Yes.

Jun 22 '21 06:06 gtnbssn

IMO, the time consuming operation is here:

https://github.com/xinntao/BasicSR/blob/5c757162b348a09d236e00c2cc04463c0a8bba45/basicsr/data/data_sampler.py#L33

The operation here is intended to keep reproducibility.

If you have a large dataset, try to reduce dataset_enlarge_ratio in the configuration file and keep the Require iter number per epoch is just greater than Total iters in logging output.

An example snippet of logging output:

2021-06-20 16:47:42,337 INFO: Training statistics:
        Number of train images: 64612                                           
        Dataset enlarge ratio: 1000                                               
        Batch size per gpu: 16                                                      
        World size (gpu number): 4                                                  
        Require iter number per epoch: 1009563                                    
        Total epochs: 1; iters: 600000.

Jun 23 '21 07:06 wwhio