GPT-J: Dataset issues
I have analyzed download_cnndm.py and dataset.py:
- validation dataset is used, not test. Why?
- there are 13368 examples which is expected ( (https://huggingface.co/datasets/cnn_dailymail#data-splits](https://huggingface.co/datasets/cnn_dailymail#data-splits) )
- Problem A: 399 examples are longer than 1919 tokens. Currently tokenizer trims end which means that in about 3% examples we are losing end
\n\n### Response:. It may impact final accuracy score. - Problem B: reference summary of 250 examples is shorter than 30 tokens.
- Problem C: reference summary of 591 examples is longer than 128 tokens.
- Problem D: two summaries are longer than full prompt
In total 1171 (almost 9%) examples do not meet our constraints.
I see that there were some attempts to address this issue. The code download_cnndm.py lines 38-50 is dead. Could you remove dead code?
Also, very confusing is default number of num_examples which is set to 4869 in main.py#L29. Where does this number come from? Shouldn't we evaluate accuracy on the full dataset?
Dataset histograms:
The code used for generating my statistics:
import json
from transformers import AutoTokenizer
def generate_prompt(sample):
return (
"Below is an instruction that describes a task, paired with an input that provides further context. "
"Write a response that appropriately completes the request.\n\n"
"### Instruction:\n" + sample['instruction'] + "\n\n### Input:\n" + sample['input'] + "\n\n### Response:"
)
with open("cnn_eval.json") as f:
dataset = json.load(f)
model_name = "EleutherAI/gpt-j-6B"
tokenizer = AutoTokenizer.from_pretrained(
model_name,
model_max_length=2048,
padding_side="left",
use_fast=False,)
tokenizer.pad_token = tokenizer.eos_token # WHY !?
for sample in dataset:
summary_len = len(tokenizer(sample['output'])['input_ids'])
prompt_len = len(tokenizer(generate_prompt(sample))['input_ids'])
print("{}, {}".format(
prompt_len,
summary_len
))
We have always used validation set and not the test set for MLPerf Inference benchmarking. I have removed the redundant code from download_cnndm.py and also updated the max_examples in main.py