A mismatch found in your code and paper.
In your paper, I find
(1) we get the local learning rate for each learnable parameter by α = l×||w||2/(||∇w||2+β||∇w||2);
But in your code,
rate = gw_ratio * w_norm / (wgrad_norm + weight_decay * w_norm);
The code and equation doesn't match. Is it a type error in your paper?
α equals l×||w||2/(||∇w||2+β||w||2); I think.
Code is correct. We are working on the paper update :)
@borisgin Another miss match.
In https://github.com/borisgin/nvcaffe-0.16/blob/caffe-0.16/models/bvlc_alexnet/solver_8K.prototxt
I found you set rampup_interval: 600, so the warmup for 8k batchsize is 4 epochs.
But your paper(Table 6) said, warm up is 8 epochs.
Something wrong with the paper?
Yesterday I experimented with short ramp-up
Boris
On Aug 31, 2017, at 11:35 PM, Xin Qiu [email protected] wrote:
@borisgin Another miss match. In https://github.com/borisgin/nvcaffe-0.16/blob/caffe-0.16/models/bvlc_alexnet/solver_8K.prototxt I found you set rampup_interval: 600, so the warmup for 8k batchsize is 4 epochs. But your paper(Table 6) said, warm up is 8 epochs. Something wrong with the paper?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@borisfom
Maybe another mismatch: wgrad_norm in your code is computed from "g + beta* w"(it is computed after regularization), not exactly the same as paper's "g".
And Maybe weight_decay in "rate = gw_ratio * w_norm / (wgrad_norm + weight_decay * w_norm)" should be global weight_decay * local_decay?
It's interesting idea! How would you tune local decay automatically? In our experiments we used only global weight decay which was fixed.
Thanks. I use fixed local decay. What do you think about the first miss match I mentioned above, please?
For GPU branch, weight update, moment , and regularization are fused in to one kernel. So the Regularize() function skips this stage.
wow,I didn't notice that. Thanks
hi @borisgin , Did you use any auto local decay when training ImageNet, along with LARS solver? I found auto local decay feature in your code, But no any clue in the paper. Thanks
My branch has a lot of experimental knobs which I don’t put into official nvcaffe branch since they did not prove themselves yet . Also some features were developed after we submitted paper :). Code supports both momentum and weight decay adjustment policies (currently this is only “poly” or “fixed”) which I used for my experiments with Neumann optimizer
@borisgin How is your experiments with Neumann optimizer? Can I see the code of Neumann optimizer
I experimented with simplifed Neuman optimizer ( wo external loop) . I found that it behaves very similar to the standard SGD with momentum. So we decided not to add this optimizer to nvidia/caffe
你好,请问在用多卡增大batchsize训练的时候,
这个batchsize大小是多卡总的batchsize还是单卡的batchsize?