CAT Distilling doesn't work as expected.

Hello,

Since the last question, #24, I tried 512x512 resolution training for both teacher and student models. I found that the teacher model in 512x512 works fine, but student training is not working. I wonder if I can get some hints why

Tfake img Sfake img (274/1000 epoch)

training options

!python train.py --dataroot database/face2smile \
  --model cycle_gan \
  --log_dir logs/cycle_gan/face2smile/teacher_512 \
  --netG inception_9blocks \
  --real_stat_A_path real_stat_512/face2smile_A.npz \
  --real_stat_B_path real_stat_512/face2smile_B.npz \
  --batch_size 4 \
  --num_threads 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --save_latest_freq 10000 --save_epoch_freq 5 \
  --epoch_base 176 --iter_base 223395 \
  --nepochs 324 --nepochs_decay 500 \
  --preprocess scale_width --load_size 512 \

!python distill.py --dataroot database/face2smile \
  --dataset_mode unaligned \
  --distiller inception \
  --gan_mode lsgan \
  --log_dir logs/cycle_gan/face2smile/student_512 \
  --restore_teacher_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_pretrained_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
  --restore_D_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_D_A.pth \
  --real_stat_path real_stat_512/face2smile_B.npz \
  --teacher_netG inception_9blocks --student_netG inception_9blocks \
  --pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
  --ndf 64 \
  --num_threads 32 \
  --eval_batch_size 4 \
  --batch_size 32 \
  --gpu_ids 0,1,2,3 \
  --norm_affine \
  --norm_affine_D \
  --channels_reduction_factor 6 \
  --kernel_sizes 1 3 5 \
  --lambda_distill 1.0 \
  --lambda_recon 5 \
  --prune_cin_lb 16 \
  --target_flops 2.6e9 \
  --distill_G_loss_type ka \
  --preprocess scale_width --load_size 512 \
  --save_epoch_freq 2 --save_latest_freq 1000 \
  --nepochs 500 --nepochs_decay 500 \
  --norm_student batch \
  --padding_type_student zero \
  --norm_affine_student \
  --norm_track_running_stats_student

Oct 20 '22 20:10 youjin-c

Maybe you could try to increase the reconstruction loss for student training. For example, increase --lambda_recon to 100 and give it a try.

Oct 20 '22 20:10 alanspike

@alanspike Thanks. Let me give it a try!

Oct 20 '22 20:10 youjin-c

Hi, @alanspike, I tried the options below, but sadly the student Sfake looks like the original input image rather than trained.

  --lambda_distill 1.0 \
  --lambda_recon 100 \

This one I referred from the Jupyter notebook tutorial

  --lambda_distill 2.8 \
  --lambda_recon 1000 \

This is Sfake of epoch 526 / 1000.

This is the input image.

Oct 23 '22 23:10 youjin-c

Thanks for sharing the results. It's a bit weird since the lambda_recon is used to increase the reconstruction loss between the student model and the teacher model as shown here so that the generated images from the student model should be more similar to the teacher model. Just wondering that the output from teacher model is normal, right? Also, did you observe larger reconstruction loss after increasing lambda_recon?

Oct 26 '22 17:10 alanspike

Hello @alanspike , thanks for checking in.

yes the teacher model works fine. This is the Tfake image of the same epoch.

Below are the logs with different options. I noticed G_recon increased, and D values decreased dramatically..almost zero's. Do you think increasing the learning rate can be helpful? --lambda_distill 1.0 \ --lambda_recon 5 \ epoch: 274, iters: 43200, time: 0.964) G_gan: 0.928 G_distill: -15.980 G_recon: 0.393 D_fake: 0.014 D_real: 0.010 --lambda_distill 1.0 \ --lambda_recon 100 \ epoch: 342, iters: 54000, time: 1.009) G_gan: 0.996 G_distill: -15.916 G_recon: 5.616 D_fake: 0.004 D_real: 0.005 --lambda_distill 2.8 \ --lambda_recon 1000 \ epoch: 527, iters: 83200, time: 0.966) G_gan: 0.991 G_distill: -44.196 G_recon: 53.165 D_fake: 0.001 D_real: 0.001

Oct 26 '22 19:10 youjin-c

Could you try to set the weight of the adversarial loss as zero and see whether the reconstruction loss is decreasing?

Nov 01 '22 15:11 alanspike

@alanspike Do you mean --lambda_B (in case I am training AtoB) or --lambda_identity by the weight of the adversarial loss?

Nov 01 '22 20:11 youjin-c

Could you maybe set this loss as zero, and comment the training of discriminator here? I'm not sure about the reason so I wonder maybe we could try to remove the discriminator (without adversarial training) and just use the reconstruction and distillation loss, to see whether the G_recon could decrease during the training. If the G_recon could decrease to a reasonable value, then we should be able to get the smiling face.

Nov 04 '22 16:11 alanspike

Thank you @alanspike! I am running the distiller as you commented.

While waiting for the result, I'd like to share some photos I acquired during the last distilling.

  --lambda_distill 1.1 \
  --lambda_recon 10 \

(epoch: 338, iters: 53400, time: 0.941) G_gan: 0.992 G_distill: -17.534 G_recon: 0.668 D_fake: 0.006 D_real: 0.004 
End of epoch 338 / 1000 	 Time Taken: 167.23 sec
###(Evaluate epoch: 338, iters: 53405, time: 63.128) fid: 73.587 fid-mean: 73.792 fid-best: 72.101 
Saving the model at the end of epoch 338, iters 53405
learning rate = 0.0002000

00015 (2) 00012

I could see these patterns for all distilling, regardless of the options.

Nov 04 '22 21:11 youjin-c

@alanspike Hello, I ran the distiller with --lambda_gan zero and commented on the training of discriminator (line 182 - line 185), learning rate 0.0002000, and decay after epoch 500, ran till epoch 1000.

I could see that G_recon decreased swaying and very slowly. Even ran 1000 epochs, it just reached around 0.437 at minimum. I guess the learning rate is small, or I should run more epochs to see G_recon decrease more, or both. I wonder about your opinion on it, and below are some photos from the latest epoch.

This is the log file during distilling 00015 (2) 00013 (1) 00012

Nov 07 '22 22:11 youjin-c

Maybe the obtained student network is too small using the default target FLOPs for the larger-resolution. Could you try using larger FLOPs to compress?

Nov 15 '22 04:11 alanspike

@alanspike sure, thanks your opinion. Let me run with larger FLOPs and update the result here.

Nov 15 '22 14:11 youjin-c