Distilling doesn't work as expected.
Hello,
Since the last question, #24, I tried 512x512 resolution training for both teacher and student models. I found that the teacher model in 512x512 works fine, but student training is not working. I wonder if I can get some hints why
Tfake img
Sfake img (274/1000 epoch)

training options
!python train.py --dataroot database/face2smile \
--model cycle_gan \
--log_dir logs/cycle_gan/face2smile/teacher_512 \
--netG inception_9blocks \
--real_stat_A_path real_stat_512/face2smile_A.npz \
--real_stat_B_path real_stat_512/face2smile_B.npz \
--batch_size 4 \
--num_threads 32 \
--gpu_ids 0,1,2,3 \
--norm_affine \
--norm_affine_D \
--channels_reduction_factor 6 \
--kernel_sizes 1 3 5 \
--save_latest_freq 10000 --save_epoch_freq 5 \
--epoch_base 176 --iter_base 223395 \
--nepochs 324 --nepochs_decay 500 \
--preprocess scale_width --load_size 512 \
!python distill.py --dataroot database/face2smile \
--dataset_mode unaligned \
--distiller inception \
--gan_mode lsgan \
--log_dir logs/cycle_gan/face2smile/student_512 \
--restore_teacher_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
--restore_pretrained_G_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_G_A.pth \
--restore_D_path logs/cycle_gan/face2smile/teacher_512/checkpoints/170_net_D_A.pth \
--real_stat_path real_stat_512/face2smile_B.npz \
--teacher_netG inception_9blocks --student_netG inception_9blocks \
--pretrained_ngf 64 --teacher_ngf 64 --student_ngf 20 \
--ndf 64 \
--num_threads 32 \
--eval_batch_size 4 \
--batch_size 32 \
--gpu_ids 0,1,2,3 \
--norm_affine \
--norm_affine_D \
--channels_reduction_factor 6 \
--kernel_sizes 1 3 5 \
--lambda_distill 1.0 \
--lambda_recon 5 \
--prune_cin_lb 16 \
--target_flops 2.6e9 \
--distill_G_loss_type ka \
--preprocess scale_width --load_size 512 \
--save_epoch_freq 2 --save_latest_freq 1000 \
--nepochs 500 --nepochs_decay 500 \
--norm_student batch \
--padding_type_student zero \
--norm_affine_student \
--norm_track_running_stats_student
Maybe you could try to increase the reconstruction loss for student training. For example, increase --lambda_recon to 100 and give it a try.
@alanspike Thanks. Let me give it a try!
Hi, @alanspike, I tried the options below, but sadly the student Sfake looks like the original input image rather than trained.
--lambda_distill 1.0 \
--lambda_recon 100 \
This one I referred from the Jupyter notebook tutorial
--lambda_distill 2.8 \
--lambda_recon 1000 \
This is Sfake of epoch 526 / 1000.
This is the input image.
Thanks for sharing the results. It's a bit weird since the lambda_recon is used to increase the reconstruction loss between the student model and the teacher model as shown here so that the generated images from the student model should be more similar to the teacher model. Just wondering that the output from teacher model is normal, right? Also, did you observe larger reconstruction loss after increasing lambda_recon?
Hello @alanspike , thanks for checking in.
yes the teacher model works fine. This is the Tfake image of the same epoch.

Below are the logs with different options. I noticed G_recon increased, and D values decreased dramatically..almost zero's. Do you think increasing the learning rate can be helpful?
--lambda_distill 1.0 \ --lambda_recon 5 \
epoch: 274, iters: 43200, time: 0.964) G_gan: 0.928 G_distill: -15.980 G_recon: 0.393 D_fake: 0.014 D_real: 0.010
--lambda_distill 1.0 \ --lambda_recon 100 \
epoch: 342, iters: 54000, time: 1.009) G_gan: 0.996 G_distill: -15.916 G_recon: 5.616 D_fake: 0.004 D_real: 0.005
--lambda_distill 2.8 \ --lambda_recon 1000 \
epoch: 527, iters: 83200, time: 0.966) G_gan: 0.991 G_distill: -44.196 G_recon: 53.165 D_fake: 0.001 D_real: 0.001
Could you try to set the weight of the adversarial loss as zero and see whether the reconstruction loss is decreasing?
@alanspike
Do you mean --lambda_B (in case I am training AtoB) or --lambda_identity by the weight of the adversarial loss?
Could you maybe set this loss as zero, and comment the training of discriminator here? I'm not sure about the reason so I wonder maybe we could try to remove the discriminator (without adversarial training) and just use the reconstruction and distillation loss, to see whether the G_recon could decrease during the training. If the G_recon could decrease to a reasonable value, then we should be able to get the smiling face.
Thank you @alanspike! I am running the distiller as you commented.
While waiting for the result, I'd like to share some photos I acquired during the last distilling.
--lambda_distill 1.1 \
--lambda_recon 10 \
(epoch: 338, iters: 53400, time: 0.941) G_gan: 0.992 G_distill: -17.534 G_recon: 0.668 D_fake: 0.006 D_real: 0.004
End of epoch 338 / 1000 Time Taken: 167.23 sec
###(Evaluate epoch: 338, iters: 53405, time: 63.128) fid: 73.587 fid-mean: 73.792 fid-best: 72.101
Saving the model at the end of epoch 338, iters 53405
learning rate = 0.0002000

I could see these patterns for all distilling, regardless of the options.
@alanspike Hello,
I ran the distiller with --lambda_gan zero and commented on the training of discriminator (line 182 - line 185), learning rate 0.0002000, and decay after epoch 500, ran till epoch 1000.
I could see that G_recon decreased swaying and very slowly. Even ran 1000 epochs, it just reached around 0.437 at minimum.
I guess the learning rate is small, or I should run more epochs to see G_recon decrease more, or both.
I wonder about your opinion on it, and below are some photos from the latest epoch.
Maybe the obtained student network is too small using the default target FLOPs for the larger-resolution. Could you try using larger FLOPs to compress?
@alanspike sure, thanks your opinion. Let me run with larger FLOPs and update the result here.
