"Rank classification" in evaluation for multiple choices
Hi,
Thanks for the repo! I was wondering if you would please point out which lines of code are for the "rank classification" idea used for evaluating the multiple-choice style tasks?
The paper describes it like this on Page 6:
For tasks that involve choosing the correct completion from several options (e.g. multiple choice question answering), we follow Brown et al. (2020) and use rank classification to evaluate our model: we compute the log-likelihood of each of the target options under the fine-tuned model and select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply length normalization to the log-likelihoods of the target options.
Thank you!
Ah I think I found it here in the forward function of the customized EncoderDecoderModel class: https://github.com/bigscience-workshop/t-zero/blob/25c0761427f3894a8ec5a062a075b96037fb1492/t0/model.py#L56
However, I was wondering if you would please help give a short tutorial that how we can use the same idea to easily evaluate other LMs (say a fine-tuned BART) to make sure the comparisons are fair.