Shuming Ma comments

Results 14 comments of


                                            Shuming Ma

code for deepnet & reproduction

Hi @KennyShang @maydaygmail Could you provide more details about your implementation (e.g. **alpha, beta** you actually used, learning rate, batch size, warmup, adam's beta)? BTW: "_Up scale the residual x...

the link of LCSTS is not avaiable

The access to LCSTS dataset requires an approval from the authors (https://arxiv.org/pdf/1506.05865.pdf), so we can not publicly release the dataset. I suggest you sending an email to the authors of...

embed_tokens

A simple ```nn.Embeddng(vocab_size, embedding_size)``` will work. Or you can refer to our example on [language modeling](https://github.com/microsoft/torchscale/blob/main/examples/fairseq/models/language_modeling.py#L215).

about the longnet's ppl

We didn't claim that more dilation is better (thinking about an extreme case that the segment length starts from 1). We suggest the segment length not less than 2048 in...

Training RetNet on A100 GPUs

You can try `--memory-efficient-fp16 --checkpoint-activations` which can signficantly reduce the memory consumption.

does torchscale functionalities can impove modeling generality and capability in case of Session-Based Recommendation system

Actually, I am not familiar with recommendation models. If Transformers4Rec adopts the standard Transformer architecture as its backbone, I think it's possible to replace the backbone with TorchScale to enjoy...

Shuming Ma