The training speed.
I have the same hardware as yours i.e. 8 Tesla V100 16G gpus.
I tried to train Stark-ST101 for GOT-10K using the default setting i.e. baseline_R101_got10k_only.
python tracking/train.py --script stark_st1 --config baseline_R101_got10k_only --save_dir . --mode multiple --nproc_per_node 8
python tracking/train.py --script stark_st2 --config baseline_R101_got10k_only --save_dir . --mode multiple --nproc_per_node 8 --script_prv stark_st1 --config_prv baseline_R101_got10k_only
It takes about 2 hours for one epoch which is quite slow. ST1 has 500 epochs which means the whole training process will take more than 30 days, it is unacceptable. You said it only takes 2 days for the whole training process https://github.com/researchmm/Stark/issues/20#issuecomment-882494556, I suspect there must be something wrong with the code.
@iminfine Hi, we train STARK with 8x 16GB Tesla V100 GPUs and the training does only takes about 2 days. This has also been verified by other teams. The extremely long training time you mentioned may be caused by (1) data IO is slow. Pls check the IO time and the forward/backward time. If IO time is obviously longer than the forward/backward time, you should consider speeding up data IO. (2) CPUs are not compatible with your GPUs. Although your GPUs are Tesla V100, if CPUs are too weak, the training speed will also be slow.
Hi, training stark on our machine also consumes too long time, about 5~6 days. Could you please provide your data IO time and forward/backward time in your training process as my reference?