Why not using calibrated_grads directly?
hello, I am very interested in your paper. thank you for the implementation. but I have some questions about your code。
in this line:
https://github.com/csyhhu/MetaQuant/blob/3169e0b11e179011b1ffd3bd8ac49fb5656d7442/meta_utils/meta_quantized_module.py#L86
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
why not using the self.calibrated_grads directly? instead, you used the refine gradients: self.weight.grad.
furthermore, the weights have been updated in the main function using the refine gradients. so i am very confused why using the refine gradients again!
Hi @ShuaiZ1037 , Thanks for your interest.
For your first question. Basically, what we want here is to update meta weights with previous weights and "refined gradients" (after calibration for pre-processing and refinement in optimizer). But it will comes with the following questions:
- If self.calibrated_grads is used directly, refinement (for Adam) will not be incorporated.
- If refined gradients (self.weight.grad) is used directly, meta network can not be incorporated (through calibrated_grads) into the computational graph since refined gradients will block its connection.
Therefore, here I use a rather trouble way to fulfill my functionality:
- Refined gradients is actually used in update of meta_weights.
- calibrated_grads is incorporated in computation such that meta network can be updated. That's the reason of line 86: calibrated_grads is added into computational graph while its value will be cancelled by "- self.calibrated_grads.data". What actually contributes to the update of meta_weight will be the refined gradient (self.weight.grad.data).
Think of using optimizer as SGD, where value of self.calibrated_grads is the same as self.weight.grad.data. While in Adam, things will be different.
For your second question, in line 86 I just add the self.calibrated_grads into computation for meta network's update. But I still need to "actually" update the value in base network. That is to say, line 86 will not update the real weights in the base network. Therefore I have to update in the main function using the refine gradients.
Indeed, it is a little bit tricky here. Hope this can solve your question. Let me know if it is still confusing.
Best regards, Shangyu
Hi, @csyhhu, Thank you for your enthusiastic and quick response. I get the motivation of using the two grads:
- calibrated_grads for meta-network updating.
- Refined gradients for incorporating refinement.
But, is there ang mismatch problem?
-
self.weight.grad.data(the gradient of last step) VSself.calibrated_grads(the predicted gradient of this step). This can be explained as gradient accumulation in some ways. However, is there any mismatch problem in backward, which is explained below. - the gradient of
self.weight.gradcomputed from loss --->self.calibrated_grads---> meta-network. Empirically, the result of your paper shows its effectiveness.
thanks, Shuai
Hi @ShuaiZ1037 ,
I don't think it can be regarded as mismatch. Since that is how normal optimization methods conduct:
Step 1: Weights receive "natural gradient", which is attained by chain rule backward propagation.
Step 2: "Natural gradient" is refined by optimization algorithm (such as Adam) to get self.weight.grad
Step 3: Actual weights update use the refined gradient self.weight.grad
My method simply puts "natural gradient" to generate the meta gradient and corresponding self.calibrated_grads, which can be regarded as a modification in Step 1. And I follow step 2&3 to finish the rest.
If you are refering that theself.weight.grad comes from the loss of self.calibrated_grads instead of the true loss of base model, that is correct and indeed gradient mismatch. However the loss of self.calibrated_grads also comes from the loss of base model.
Best regards, Shangyu
@ShuaiZ1037 For the question "Why not using calibrated_grads directly?". If SGD is used, calibrated_grads can be used directly. For other optimization methods requiring refinement of the gradient, calibrated_grads needs to be further processed to follow the procedure of the corresponding optimization algorithm.
@csyhhu 你好,多谢你热心的回复,我明白你的意思,可能是我前面的表述有问题。
不好意思,我用汉语解释一下:
1.第一个问题是,在第t次迭代最后,使用Adam或者SGD算法的到refine的梯度,然后更新了每一个参数。在t+1次迭代仍然使用上诉refine的梯度计算self.meta_weight,然后进行卷积。这个问题不大,关键是后面。
2.在self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 中,我明白其目的与意义是将base model的loss的梯度传递到meta-net,让meta-net更新。如果直接使用self.calibrated_grads,当然关于loss的梯度很自然的回传到meta-net(当然,会存在您说梯度没有使用refine(Adam)的问题)。
但是如果是您代码所示的:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
那么,实际上,计算是使用的self.weight.grad.data得到最后的basemodel的损失,反向传播时,self.weight.grad.data的关于loss的梯度,实际是赋值给了self.calibrated_grads,然后传递到meta-net。
然而,self.weight.grad是上一次迭代meta-net生成并refine的梯度,而self.calibrated_grads是当前迭代meta-net生成的梯度。
所以,我的困惑是能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?
我运行了您的代码,结果也是如您paper所示很好的结果。 希望我表述清楚了我的意思,如果我存在的困惑是因为有什么常识性错误或者理解您的paper有问题,麻烦您指出来!
谢谢! 帅
@ShuaiZ1037 再次感谢您的兴趣和耐心。
self.weight.grad 的确是上一次迭代meta-net产生并refine的梯度,self.calibrated_grads其实也是对应了上一次迭代的梯度。因为self.calibrated_grads是由上一次迭代产生的pre_quantized_weight产生的。
在这段代码中:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()), 两个关键的梯度其实都是上一轮data产生的。这也符合梯度下降的意思:本轮用于计算loss的weights由上一轮的weights和上一轮产生的gradient加和而成。
你可以注意到,在line 252中,整个训练的第一次迭代是不会更新网络参数,后面才开始更新,所以有点“延迟更新”的感觉。
但这个应该对你困惑影响不大。但我还是不大理解您的困惑:“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?” 如果您是想问能否把self.weight.grad.data赋值给self.calibrated_grads,我似乎并没有这样的操作。self.calibrated_grads是来自pre_quantized_weight, 可参见:https://github.com/csyhhu/MetaQuant/blob/3169e0b11e179011b1ffd3bd8ac49fb5656d7442/meta_utils/helpers.py#L10
这个函数。
如果我还是未能解决您的问题,请继续询问~
祝好! 上宇
@csyhhu 感谢您热心而迅速的解答。
正如您所述的:self.calibrated_grads是当前迭代(t)meta-net合成的,对应于上轮(t-1)的pre_quantized_weight;同样self.weight.grad上一次迭代(t-1)meta-net产生并refine的梯度,对应于上上轮(t-2)pre_quantized_weight。所以两个梯度不是一回事。(不清楚我的理解有没有问题)
“能将self.weight.grad.data的关于loss的梯度赋值给self.calibrated_grads吗?”的意思是:
代码:self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach())
的目的是为了将base-model的梯度传递给meta-net,前向传播中,self.weight.grad.data参与运算,但是反向传播过程中,关于损失的梯度(从某种程度上可以这么理解):(self.weight.grad.data).grad(由于detach的使用,实际并不会有grad)传递(赋值)(self.calibrated_grads).grad,然后传递到meta-net。
其实本质上我的疑惑,就是两个梯度并不是一回事,或者是分别两个迭代步骤的梯度。像代码:
self.meta_weight = self.weight - \ lr * (self.calibrated_grads \ + (self.weight.grad.data - self.calibrated_grads.data).detach()) 这样使用,前向传播中,使用一个梯度,而反向传播使用另外一个,那么反向传播的梯度难道不会不匹配吗?您解释了这么做的目的,但是我仔细阅读了paper和code,没有明白这么做的正确性,或者说这么处理为啥正确。
如果我的理解有问题,麻烦您指出来!再次感谢您不厌其烦的解答!
感谢! 帅
@ShuaiZ1037 感谢您的解释。
我理解您的意思了,我想了想,似乎的确存在self.weight.grad.data与self.calibrated_grads mismatch的问题。在本轮的更新中,self.weight.grad.data应该是要来自self.calibrated_grads ,而不是上一轮的self.calibrated_grads.
正确的做法应该是把line 239-249整体提前到line 226上面,这样就能解决mismatch的问题。
估计差别不大,我会做下相关实验。
非常感谢您的指正!
祝好! 上宇
@csyhhu 非常感谢您的解答。 按照您的解释,前面的问题得到解决,再次感谢您今天迅速而耐心的解答!
祝好! 帅