improved-diffusion icon indicating copy to clipboard operation
improved-diffusion copied to clipboard

Training time for ImageNet and CIFAR on 4xV100 GPUs

Open Michaelsqj opened this issue 3 years ago • 2 comments

Hi, there! May I ask how long would the training normally takes for training on the CIFAR10 and ImageNet? I'm using 4x 16GB V100 GPUs

I used the following settings

export OPENAI_LOGDIR="improved_diffusion" MODEL_FLAGS="--image_size 64 --num_channels 192 --num_res_blocks 3 --learn_sigma True --class_cond True" DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule cosine --rescale_learned_sigmas False --rescale_timesteps False" TRAIN_FLAGS="--lr 3e-4 --batch_size 256 --microbatch 16" NUM_GPUS=4 DATA_DIR="cifar_train/" CUDA_VISIBLE_DEVICES=0,1,2,3 mpiexec -n $NUM_GPUS python scripts/image_train.py --data_dir $DATA_DIR $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

and it outputs 3 lines

Logging to improved_diffusion/ creating model and diffusion... creating data loader... training...

and stays like this forever without new output.

Michaelsqj avatar Jan 18 '23 00:01 Michaelsqj

Hello, How did you setup your hyperparameter? or where can i find them? I trying to setup mine to train the model Thanks.

Shaaii22 avatar Jan 29 '23 16:01 Shaaii22

Hello, I met the same problem, have you already solved it? Thanks.

V1oletM avatar Sep 11 '23 02:09 V1oletM