Ziyue Huang issues

Results 8 issues of


                                            Ziyue Huang

sliding window self-attention cell

## Description ## The AttentionCell for the sliding window self-attention, including the support for multi-headed dilation and the causal attention mode, described in Longformer: The Long-Document Transformer. cc @sxjscience @szhengac...

[Proposal] Unified Interface/Implementation for Sparse Attention

Currently several schemes of sparse attention (e.g. block-sparse, sliding window) relies on the handcrafted kernels, and it takes plenty of effort to implement new schemes (for research or other purpose)....

[WIP] distributed training

## Description ## Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each...

API changes of bps.mxnet.trainer after https://github.com/bytedance/byteps/pull/225

After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L322) is wrong, as it will ignore `self._scale = self._optimizer.rescale_grad` setting in the...

mxnet: fix for mxnet 2.0

Support BytePS for MXNet 2.0. cc @eric-haibin-lin

The thread safety of `RunCallback`

Is [KVWorker::RunCallback](https://github.com/dmlc/ps-lite/blob/master/include/ps/kv_app.h#L643-L654) thread safe? According to [iterator invalidation rules](https://stackoverflow.com/questions/6438086/iterator-invalidation-rules-for-c-containers), rehashing invalidates the iterators of `unordered_map`, which may occur during the execution of the callback (`it->second();`). I didn't encounter any error...

Fix the thread safety of RunCallback

Fix https://github.com/dmlc/ps-lite/issues/189. cc @eric-haibin-lin

Fix & improve activation offloading

The `torch.distributed.algorithms._checkpoint.OffloadWrapper` seems also offloading the parameters (not only the activations) to cpu, because autograd Function also saves parameters for backward. This can be verified by the script below, see...

cla signed