BERT4doc-Classification
BERT4doc-Classification copied to clipboard
For Layer-wise Decreasing Layer Rate
Thanks for your hard work! I have two questions. First, for Layer-wise Decreasing Layer Rate, did you use a warm-up or polynomial_decay simultaneous?,and it means that warm-up rate and Layer-wise Decreasing Layer Rate are used simultaneous? Second, for large bert, how did you set the Learning rate and Decay factor which the paper didn't give?
sorry for a late answer
- we also use a warm-up for layer-wise decreasing layer rate, which means, they are used simultaneously
- we do not conduct experiments about learning rates on large bert, but we empirically observe that bert-large has similar results comparing to bert-base.