GPT-J: Dataset issues

Open szutenberg opened this issue 2 years ago • 1 comments

I have analyzed download_cnndm.py and dataset.py:

validation dataset is used, not test. Why?
there are 13368 examples which is expected ( (https://huggingface.co/datasets/cnn_dailymail#data-splits](https://huggingface.co/datasets/cnn_dailymail#data-splits) )
Problem A: 399 examples are longer than 1919 tokens. Currently tokenizer trims end which means that in about 3% examples we are losing end \n\n### Response:. It may impact final accuracy score.
Problem B: reference summary of 250 examples is shorter than 30 tokens.
Problem C: reference summary of 591 examples is longer than 128 tokens.
Problem D: two summaries are longer than full prompt

In total 1171 (almost 9%) examples do not meet our constraints.

I see that there were some attempts to address this issue. The code download_cnndm.py lines 38-50 is dead. Could you remove dead code?

Also, very confusing is default number of num_examples which is set to 4869 in main.py#L29. Where does this number come from? Shouldn't we evaluate accuracy on the full dataset?

Dataset histograms:

The code used for generating my statistics:

import json
from transformers import AutoTokenizer

def generate_prompt(sample):
    return (
        "Below is an instruction that describes a task, paired with an input that provides further context. "
        "Write a response that appropriately completes the request.\n\n"
        "### Instruction:\n" + sample['instruction'] + "\n\n### Input:\n" + sample['input'] + "\n\n### Response:"
    )

with open("cnn_eval.json") as f:
    dataset = json.load(f)

model_name = "EleutherAI/gpt-j-6B"
tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            model_max_length=2048,
            padding_side="left",
            use_fast=False,)
tokenizer.pad_token = tokenizer.eos_token # WHY !?

for sample in dataset:
    summary_len = len(tokenizer(sample['output'])['input_ids'])
    prompt_len  = len(tokenizer(generate_prompt(sample))['input_ids'])
    print("{}, {}".format(
        prompt_len,
        summary_len
    ))

May 31 '23 13:05 szutenberg

We have always used validation set and not the test set for MLPerf Inference benchmarking. I have removed the redundant code from download_cnndm.py and also updated the max_examples in main.py

May 31 '23 23:05 badhrisuresh