Jordan Clive
Jordan Clive
The references in /evaluation/dart_reference are not for the current version. Can you replace with the new references and share the tokenization script that is done to the predictions. I am...
really not clear from paper: 'computed as cosine similarity with annealing between the encodings hx and hy. It starts at 1 and ends atp d, linearly increasing over the first...
probs want no biases, stop model bloating
need to check ive implemented label smoothing with how authors how they label smoothed their objective sampling as objective fn includes negative sampling.
currently using torch gelu. fast gelu in paper
implement BPE from scratch with unk tokens hashed (although may achieve worse results on downstream tasks) as # perhaps not as general as bpemb's 25000.model
This PR adds LoRA and prefix-tuning as modelling options (training and sampling code). Both have shown strong performance and can outperform fine-tuning. They also can protect against the catastrophic forgetting...