Why 'rescale_grad': 1.0 / len(ctx) if len(ctx) > 0 else 1.0 ?

Open Yezipiaomu opened this issue 8 years ago • 4 comments

I am confused about this setting , because mxnet forward and backward on each device, and then sum and average the params, so I am confused about this setting ,it will damage the results in multi-device .Please answer and thank you ! @zhreshold

Oct 08 '17 12:10 Yezipiaomu

Normally we would rescale by batch-size, however, In my experiments, the behavior don't scale up when batch size is changed. The division by len(ctx) is a hack to the fact that I used nomralization in makeLoss layer.

Oct 08 '17 18:10 zhreshold

Yes, I know, your makeloss layer use the valid normalization, which I support be rescaled by the 1/[num of valid] = 1/ [the sum of num of valid each sample] ,which implies that it has normalize by the batch size .Besides, the mxnet average the arg params of each devices, **so it is not neccessary for the division by the len(ctx) **

Oct 10 '17 04:10 Yezipiaomu

In makeLoss layer, gradients are assigned inside each device, thus effective batch-size is divided by len(ctx)

Oct 10 '17 05:10 zhreshold

But the mxnet average the arg params of each devices, which implies division by len(ctx)

Oct 10 '17 06:10 Yezipiaomu