Provide the shuffled index_mapping npy files for ease of reproducing training data

Open ziqi-zhang opened this issue 1 year ago • 1 comments

Hi,

I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files

    doc_idx_filename = _filename + "_doc_idx.npy"
    sample_idx_filename = _filename + "_sample_idx.npy"
    shuffle_idx_filename = _filename + "_shuffle_idx.npy"

I was wondering can you provide a copy of these files so that I don't need to regenerate them?

I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.

I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.

Thanks

Mar 14 '24 17:03 ziqi-zhang

Hi, This doesn't answer your question regarding the deduped data index files. I just wanted to mention that the index files for the non deduped pile are available as part of the PolyPythia seeds (in folder 'seed0', which is actually seed 1234).

link: https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds/tree/main/seed0

May 13 '25 23:05 efittschen