API changes of bps.mxnet.trainer after https://github.com/bytedance/byteps/pull/225
After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/init.py#L322) is wrong, as it will ignore self._scale = self._optimizer.rescale_grad setting in the construction method, moreover, self._scale is designed to be multiplied on the gradients (in gluon.trainer and the previous version of bps.mxnet.trainer), not division in this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/init.py#L337).
Then the bert example here (https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L384) is wrong when using the current byteps master, as the gradients will be divided twice by num_workers.
Any particular reason for these changes solely for gradient compression? I didn't read this PR in detail and maybe miss something. @eric-haibin-lin @szhengac
@ymjiang Hi, did you test the accuracy of the BERT model trained by https://github.com/byteps/examples/blob/master/mxnet/bert-large ? It seems that in this script, the NSP loss is normalized (by batch_size) on each worker, but MLM loss is not normalized (by num_masks), see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/pretraining_utils.py#L333-L343. Then the MLM loss will likely overwhelm the NSP loss (see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L177), however, NSP loss performs an important role in training BERT, see section 5.1 in https://arxiv.org/pdf/1810.04805.pdf. cc @eric-haibin-lin
Seems like there are two questions.
The first one is about API changes after the gradient compression PR. @vycezhong Can you please take a look?
The second question is about MLM loss. The example repo is using an old version of gluon-nlp (v0.8) so it is possible that some metrics are not handled properly. If it is fixed in the latest gluon-nlp, are you interested in creating a PR for this? Thank you.
@ZiyueHuang Yes, it will ignore self._scale = self._optimizer.rescale_grad. We did it on purpose. This change makes no difference for full-precision. But for compression, gradients needs to be averaged before being compressed.
Why not multiply num_workers back to the gradients at the end of _allreduce_grads? In order to let this API compute the sum, which is consistent to the previous API (and gluon.trainer and hvd.trainer also did sum in allreduce_grads).
Why ignore self._optimizer.rescale_grad? This is the parameter given by the user, e.g., trainer = bps.Trainer(optimizer_params={'rescale_grad': 2}). Why not self._scale *= batch_size instead of self._scale = batch_size? And in the current implementation, this will lead to different behavior for trainer.step and trainer.update. For example,
trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.step(1)
will be different from
trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.allreduce_grads()
trainer.update(1)
Let me summarize the API behavior change after/before this PR, and feel free to correct me if I make some mistakes :)
-
allreduce_gradscomputes the average instead of the sum. Note that this will break some packages which depends on BytePS, e.g., gluon-nlp-BERT in your recently released example. @ymjiang - if the user call
step, now his/her input parameterrescale_gradwill make no use, but this is not true if the user callallreduce_gradsthenupdate. While in the previous version (also in gluon.trainer and hvd.trainer ), the user's input parameterrescale_gradalways take effect. - And now if the user call
allreduce_gradsthenupdate, now the gradients will be divided by his/her input parameterrescale_grad, instead of multiplied in the previous version (also in gluon.trainer and hvd.trainer ).
Are all these changes intended by you @eric-haibin-lin @szhengac ? Then why not describe these changes in the PR description? It seems confused to me. Or do I miss something?
@ZiyueHuang
I remember previouly it also computed the average. https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/init.py#L201
And I agree on your suggestion of using self._scale *= batch_size. It looks right.
But calling allreduce_grads then update, is a use case of MXNet? I never thought about it.
https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/init.py#L201 only takes effect in step or update. bps.trainer.allreduce_grads will compute the sum.
allreduce_grads then update is allowed by MXNet API and heavily used in gluon-nlp.
I never thought of this use case.. Let me see how to fix it.
This use case is documented in the doc of gluon.Trainer.... and also used by byteps/example/gluon-nlp-BERT.
I think the most appropriate way is to multiply num_workers back to the gradients at the end of _allreduce_grads, in order to let this API compute the sum. I have discussed this with @eric-haibin-lin yesterday and he currently thinks that let allreduce_grads compute the average is more semantically appropriate and breaking API in 0.x version is OK. However, I hold an opposite opinion. But we both do think that this major API change should be discussed in the community.
Another reason to compute average instead of sum is that it prevents overflow when training with FP16 and large number of workers.