Jordan Clive issues

Results 23 issues of


                                            Jordan Clive

Update Evaluation used for V.1.1.1

The references in /evaluation/dart_reference are not for the current version. Can you replace with the new references and share the tokenization script that is done to the predictions. I am...

Cosine annealing vs cosine similarity clarification

really not clear from paper: 'computed as cosine similarity with annealing between the encodings hx and hy. It starts at 1 and ends atp d, linearly increasing over the first...

check with authors l2 reg vs weight decay, obvs different for adam/adagrad optimizers

no. of parameters discrepancy with original paper

investigate biases before layernorms

probs want no biases, stop model bloating

label smoothing

need to check ive implemented label smoothing with how authors how they label smoothed their objective sampling as objective fn includes negative sampling.

fast gelu vs torch gelu

currently using torch gelu. fast gelu in paper

BPE from scratch

implement BPE from scratch with unk tokens hashed (although may achieve worse results on downstream tasks) as # perhaps not as general as bpemb's 25000.model

Add LoRA and Prefix-Tuning as Modeling Options for Improved Memory Efficiency + performance (potentially)

This PR adds LoRA and prefix-tuning as modelling options (training and sampling code). Both have shown strong performance and can outperform fine-tuning. They also can protect against the catastrophic forgetting...