Why 'rescale_grad': 1.0 / len(ctx) if len(ctx) > 0 else 1.0 ?
I am confused about this setting , because mxnet forward and backward on each device, and then sum and average the params, so I am confused about this setting ,it will damage the results in multi-device .Please answer and thank you ! @zhreshold
Normally we would rescale by batch-size, however, In my experiments, the behavior don't scale up when batch size is changed. The division by len(ctx) is a hack to the fact that I used nomralization in makeLoss layer.
Yes, I know, your makeloss layer use the valid normalization, which I support be rescaled by the 1/[num of valid] = 1/ [the sum of num of valid each sample] ,which implies that it has normalize by the batch size .Besides, the mxnet average the arg params of each devices, **so it is not neccessary for the division by the len(ctx) **
In makeLoss layer, gradients are assigned inside each device, thus effective batch-size is divided by len(ctx)
But the mxnet average the arg params of each devices, which implies division by len(ctx)