Shuming Ma
Shuming Ma
Hi @KennyShang @maydaygmail Could you provide more details about your implementation (e.g. **alpha, beta** you actually used, learning rate, batch size, warmup, adam's beta)? BTW: "_Up scale the residual x...
The access to LCSTS dataset requires an approval from the authors (https://arxiv.org/pdf/1506.05865.pdf), so we can not publicly release the dataset. I suggest you sending an email to the authors of...
A simple ```nn.Embeddng(vocab_size, embedding_size)``` will work. Or you can refer to our example on [language modeling](https://github.com/microsoft/torchscale/blob/main/examples/fairseq/models/language_modeling.py#L215).
We didn't claim that more dilation is better (thinking about an extreme case that the segment length starts from 1). We suggest the segment length not less than 2048 in...
You can try `--memory-efficient-fp16 --checkpoint-activations` which can signficantly reduce the memory consumption.
Actually, I am not familiar with recommendation models. If Transformers4Rec adopts the standard Transformer architecture as its backbone, I think it's possible to replace the backbone with TorchScale to enjoy...
Hi, thanks for the interest in our work. Unfortunately, we don't have any plan to release the checkpoints finetuned on the downstream tasks. Yet, both the pretrained DeltaLM and the...
Hi @yugaljain1999, DeltaLM supports the same languages as InfoXLM. You can find those languages in the appendix of [InfoXLM paper](https://arxiv.org/abs/2007.07834).
The codes are based on python3.5 with pytorch 0.3.1.
预处理后source text在一个文件里,target summary在另一个文件里。文件里面每一行是一个样本。其中的每个字用数字id代替,用空格分隔开。