Ziyue Huang
Ziyue Huang
## Description ## The AttentionCell for the sliding window self-attention, including the support for multi-headed dilation and the causal attention mode, described in Longformer: The Long-Document Transformer. cc @sxjscience @szhengac...
Currently several schemes of sparse attention (e.g. block-sparse, sliding window) relies on the handcrafted kernels, and it takes plenty of effort to implement new schemes (for research or other purpose)....
## Description ## Based on this branch (https://github.com/ZiyueHuang/byteps/tree/mx2), we can perform distributed training for the electra model (and other models). Tested both on a single worker and two workers, each...
After this PR, trainer.allreduce_grads will compute the average instead of the sum. Also, I think this line (https://github.com/bytedance/byteps/blob/master/byteps/mxnet/__init__.py#L322) is wrong, as it will ignore `self._scale = self._optimizer.rescale_grad` setting in the...
Support BytePS for MXNet 2.0. cc @eric-haibin-lin
Is [KVWorker::RunCallback](https://github.com/dmlc/ps-lite/blob/master/include/ps/kv_app.h#L643-L654) thread safe? According to [iterator invalidation rules](https://stackoverflow.com/questions/6438086/iterator-invalidation-rules-for-c-containers), rehashing invalidates the iterators of `unordered_map`, which may occur during the execution of the callback (`it->second();`). I didn't encounter any error...
Fix https://github.com/dmlc/ps-lite/issues/189. cc @eric-haibin-lin
The `torch.distributed.algorithms._checkpoint.OffloadWrapper` seems also offloading the parameters (not only the activations) to cpu, because autograd Function also saves parameters for backward. This can be verified by the script below, see...