stylable API changes of bps.mxnet.trainer after https://github.com/bytedance/byteps/pull/225

After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/init.py#L322) is wrong, as it will ignore self._scale = self._optimizer.rescale_grad setting in the construction method, moreover, self._scale is designed to be multiplied on the gradients (in gluon.trainer and the previous version of bps.mxnet.trainer), not division in this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/init.py#L337).

Then the bert example here (https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L384) is wrong when using the current byteps master, as the gradients will be divided twice by num_workers.

Any particular reason for these changes solely for gradient compression? I didn't read this PR in detail and maybe miss something. @eric-haibin-lin @szhengac

Sep 01 '20 11:09 ZiyueHuang

@ymjiang Hi, did you test the accuracy of the BERT model trained by https://github.com/byteps/examples/blob/master/mxnet/bert-large ? It seems that in this script, the NSP loss is normalized (by batch_size) on each worker, but MLM loss is not normalized (by num_masks), see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/pretraining_utils.py#L333-L343. Then the MLM loss will likely overwhelm the NSP loss (see https://github.com/byteps/examples/blob/master/mxnet/bert-large/gluon-nlp/scripts/bert/run_pretraining.py#L177), however, NSP loss performs an important role in training BERT, see section 5.1 in https://arxiv.org/pdf/1810.04805.pdf. cc @eric-haibin-lin

Sep 01 '20 12:09 ZiyueHuang

Seems like there are two questions.

The first one is about API changes after the gradient compression PR. @vycezhong Can you please take a look?

The second question is about MLM loss. The example repo is using an old version of gluon-nlp (v0.8) so it is possible that some metrics are not handled properly. If it is fixed in the latest gluon-nlp, are you interested in creating a PR for this? Thank you.

Sep 01 '20 13:09 ymjiang

@ZiyueHuang Yes, it will ignore self._scale = self._optimizer.rescale_grad. We did it on purpose. This change makes no difference for full-precision. But for compression, gradients needs to be averaged before being compressed.

Sep 01 '20 14:09 jasperzhong

Why not multiply num_workers back to the gradients at the end of _allreduce_grads? In order to let this API compute the sum, which is consistent to the previous API (and gluon.trainer and hvd.trainer also did sum in allreduce_grads).

Why ignore self._optimizer.rescale_grad? This is the parameter given by the user, e.g., trainer = bps.Trainer(optimizer_params={'rescale_grad': 2}). Why not self._scale *= batch_size instead of self._scale = batch_size? And in the current implementation, this will lead to different behavior for trainer.step and trainer.update. For example,

trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.step(1)

will be different from

trainer = bps.Trainer(optimizer_params={'rescale_grad': 2})
trainer.allreduce_grads()
trainer.update(1)

Sep 01 '20 14:09 ZiyueHuang

Let me summarize the API behavior change after/before this PR, and feel free to correct me if I make some mistakes :)

allreduce_grads computes the average instead of the sum. Note that this will break some packages which depends on BytePS, e.g., gluon-nlp-BERT in your recently released example. @ymjiang
if the user call step, now his/her input parameter rescale_grad will make no use, but this is not true if the user call allreduce_grads then update. While in the previous version (also in gluon.trainer and hvd.trainer ), the user's input parameter rescale_grad always take effect.
And now if the user call allreduce_grads then update, now the gradients will be divided by his/her input parameter rescale_grad , instead of multiplied in the previous version (also in gluon.trainer and hvd.trainer ).

Are all these changes intended by you @eric-haibin-lin @szhengac ? Then why not describe these changes in the PR description? It seems confused to me. Or do I miss something?

Sep 01 '20 14:09 ZiyueHuang

@ZiyueHuang

I remember previouly it also computed the average. https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/init.py#L201

And I agree on your suggestion of using self._scale *= batch_size. It looks right.

But calling allreduce_grads then update, is a use case of MXNet? I never thought about it.

Sep 01 '20 15:09 jasperzhong

https://github.com/bytedance/byteps/blob/b8948f0927/byteps/mxnet/init.py#L201 only takes effect in step or update. bps.trainer.allreduce_grads will compute the sum.

allreduce_grads then update is allowed by MXNet API and heavily used in gluon-nlp.

Sep 01 '20 15:09 ZiyueHuang

I never thought of this use case.. Let me see how to fix it.

Sep 01 '20 15:09 jasperzhong

This use case is documented in the doc of gluon.Trainer.... and also used by byteps/example/gluon-nlp-BERT.

I think the most appropriate way is to multiply num_workers back to the gradients at the end of _allreduce_grads, in order to let this API compute the sum. I have discussed this with @eric-haibin-lin yesterday and he currently thinks that let allreduce_grads compute the average is more semantically appropriate and breaking API in 0.x version is OK. However, I hold an opposite opinion. But we both do think that this major API change should be discussed in the community.

Sep 01 '20 15:09 ZiyueHuang

Another reason to compute average instead of sum is that it prevents overflow when training with FP16 and large number of workers.

Sep 01 '20 21:09 szhengac