Open-Assistant Get model evaluation working on the reward model trainer

Based on #313 , we are having issues with model evaluation in the reward model trainer (code in model/ranking). It seems that the evaluation results are not being computed and logged for some reason.

Preliminary research ~mentioned in this post https://github.com/LAION-AI/Open-Assistant/pull/313#issuecomment-1370430372, to summarize it may have something to do with the "evaluation_strategy" argument in TrainerArgs.~ was incorrect, see comment below.

It is extremely important that we have a normalized and robust test suite for our models in order to help select between them for large scale training. I've discussed this issue with @huyouare on discord and he seems to want to take it over. Will leave it to him to intro himself.

pinging @theblackcat102

Jan 05 '23 01:01 bth5032

Update, it seems like this issue is due to rankgen model not returning labels here.

To fix this issue simply return the labels (the label is always 0, this is not an issue because we always run positive and negative batches through the model, then just construct the logits after the fact, this passes no information to the model) and run training with

python trainer.py configs/rankgen-t5-base.yml

to validate accuracy is being logged.

The code can be further refactored to make it a bit cleaner.

I'm just going to do this myself since it seems @huyouare never came along and I set this up as a starter task for them.

Jan 10 '23 03:01 bth5032

Hey, I'm a research engineer currently working on a mix of ml research and software engineering with a focus on language modelling. I'd like to get involved and was wondering if this issue is free?

Jan 31 '23 15:01 ghost

Hey, I'm a research engineer currently working on a mix of ml research and software engineering with a focus on language modelling. I'd like to get involved and was wondering if this issue is free?

I've changed my username to this one.

Feb 01 '23 11:02 jackapbutler

@jackapbutler Yes, to the best of my knowledge this task is still open, but I'm not sure if the code in the repo is up to date. @theblackcat102 would know better.

Also, I'll mention I did try to fix this issue using the fix I mentioned in the last post but it didn't seem to work, so just a heads up there.

Feb 02 '23 02:02 bth5032

Cool @bth5032 and @theblackcat102 I'm happy to take a look at this and flag up any issues I find in the process (assuming the problem still exists).

Feb 09 '23 15:02 jackapbutler

Hi, currently I'm spending most of my time on #1621 and don't think I'll be able to get to this soon, thought I would un-assign myself and others can chime in if they want to take a look 👍

Feb 20 '23 17:02 jackapbutler

I can pick this up.

Feb 20 '23 17:02 maw501

PR above for this. ^^^

BTW: I noticed we're breaking ties somewhat arbitrarily for WebGPT yet the dataset has many of them - was this deliberate?

Feb 21 '23 15:02 maw501