nvcaffe A mismatch found in your code and paper.

In your paper, I find

(1) we get the local learning rate for each learnable parameter by α = l×||w||2/(||∇w||2+β||∇w||2);

But in your code,

 rate = gw_ratio * w_norm / (wgrad_norm + weight_decay * w_norm);

The code and equation doesn't match. Is it a type error in your paper? α equals l×||w||2/(||∇w||2+β||w||2); I think.

Aug 30 '17 08:08 qiuxin2012

Code is correct. We are working on the paper update :)

Aug 30 '17 14:08 borisgin

@borisgin Another miss match. In https://github.com/borisgin/nvcaffe-0.16/blob/caffe-0.16/models/bvlc_alexnet/solver_8K.prototxt I found you set rampup_interval: 600, so the warmup for 8k batchsize is 4 epochs. But your paper(Table 6) said, warm up is 8 epochs. Something wrong with the paper?

Sep 01 '17 06:09 qiuxin2012

Yesterday I experimented with short ramp-up

Boris

On Aug 31, 2017, at 11:35 PM, Xin Qiu [email protected] wrote:

@borisgin Another miss match. In https://github.com/borisgin/nvcaffe-0.16/blob/caffe-0.16/models/bvlc_alexnet/solver_8K.prototxt I found you set rampup_interval: 600, so the warmup for 8k batchsize is 4 epochs. But your paper(Table 6) said, warm up is 8 epochs. Something wrong with the paper?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Sep 01 '17 14:09 borisgin

@borisfom
Maybe another mismatch: wgrad_norm in your code is computed from "g + beta* w"(it is computed after regularization), not exactly the same as paper's "g".

Sep 26 '17 05:09 hiyijian

And Maybe weight_decay in "rate = gw_ratio * w_norm / (wgrad_norm + weight_decay * w_norm)" should be global weight_decay * local_decay?

Sep 26 '17 05:09 hiyijian

It's interesting idea! How would you tune local decay automatically? In our experiments we used only global weight decay which was fixed.

Sep 26 '17 12:09 borisgin

Thanks. I use fixed local decay. What do you think about the first miss match I mentioned above, please?

Sep 26 '17 12:09 hiyijian

For GPU branch, weight update, moment , and regularization are fused in to one kernel. So the Regularize() function skips this stage.

Sep 26 '17 14:09 borisgin

wow，I didn't notice that. Thanks

Sep 26 '17 14:09 hiyijian

hi @borisgin , Did you use any auto local decay when training ImageNet, along with LARS solver? I found auto local decay feature in your code, But no any clue in the paper. Thanks

Dec 14 '17 07:12 hiyijian

My branch has a lot of experimental knobs which I don’t put into official nvcaffe branch since they did not prove themselves yet . Also some features were developed after we submitted paper :). Code supports both momentum and weight decay adjustment policies (currently this is only “poly” or “fixed”) which I used for my experiments with Neumann optimizer

Dec 14 '17 16:12 borisgin

@borisgin How is your experiments with Neumann optimizer? Can I see the code of Neumann optimizer

Jun 07 '18 03:06 SeaOfOcean

I experimented with simplifed Neuman optimizer ( wo external loop) . I found that it behaves very similar to the standard SGD with momentum. So we decided not to add this optimizer to nvidia/caffe

Jun 07 '18 16:06 borisgin

你好，请问在用多卡增大batchsize训练的时候，这个batchsize大小是多卡总的batchsize还是单卡的batchsize？

Oct 16 '18 10:10 Tron-x