Some questions regarding the code
Hi Shangyu,
Great work! I have some questions regarding the code as below
- new_meta_hidden_state_dict is empty here?
- meta_opt_flag is unused in here?
- what is the setup for training net only after meta_net is well trained? Is there any script?
Looking forward to your reply.
Thanks
Hi @haichaoyu ,
Thanks for using my code and finding out the problems !
- This is a bug in new_meta_hidden_state_dict as you mentioned, I have fixed it. Please refer to here for the change.
- meta_opt_flag is never used, I have commented it.
- It is necessary to have meta net cooperatively trained with the base quantized net. My method can not perform training quantized net with a well-trained meta net.
Best regards, Shangyu
Hi @csyhhu,
Encountered another error here. layer.calibration does not exists. Maybe because forward in net is never called?
Hi @haichaoyu ,
Sorry for the bug. I have fixed it. Please refresh the project.
Best regards, Shangyu
Hi Shangyu,
Thanks for your quick response. The code now runs successfully. Two more questions to bother you:
- Considering LSTMFC meta_optimizer, in your code, hidden states are detached here and meta_net's output gradients are detached here. So losses will not back-propagate through hidden states and will back-propagate through output only once by here. This way, LSTM actually cannot learn the so-called long-short memory at all. Does this influence the training?
-
Here,
(self.calibrated_grads + (self.weight.grad.data - self.calibrated_grads.data).detach())(term this Equ 1) is used as update to the weight, but during the real update here, onlycalibrated_gradsis used instead, which is not the one that is used for back-propagation in Equ 1. Is it intentionally designed?
Best, Haichao
Hi @haichaoyu ,
Thanks again for your using.
-
For the first question, indeed only the output is used for backpropagation. Hidden state is used only for inference in next iteration's gradient generation. The reason behind it is that if hidden state is incorporated into backpropagation, a (similar) ''max sequence length'' need to be set to avoid memory explode, since I regard each training iteration as a "token" in LSTM treating. It may be hard to learn the long-short memory.
-
Actually real update happen after attaining the refined gradient in here. After
optimizee.get_refine_gradient(), meta gradient will be processed by optimization algorithm (e.g. Adam), then be used to update weight. As you may notice, it is not exactly the same due to the time gap:self.weight.grad.dataused in forward comes from last iteration'sself.calibrated_grads->optimizee.get_refine_gradient(). But it won't harm the training. As I have discussed the same issue in here, I will try to fix it in latter.
Best regards, Shangyu