pythia icon indicating copy to clipboard operation
pythia copied to clipboard

Provide the shuffled index_mapping npy files for ease of reproducing training data

Open ziqi-zhang opened this issue 1 year ago • 1 comments

Hi,

I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files

    doc_idx_filename = _filename + "_doc_idx.npy"
    sample_idx_filename = _filename + "_sample_idx.npy"
    shuffle_idx_filename = _filename + "_shuffle_idx.npy"

I was wondering can you provide a copy of these files so that I don't need to regenerate them?

I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.

I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.

Thanks

ziqi-zhang avatar Mar 14 '24 17:03 ziqi-zhang

Hi, This doesn't answer your question regarding the deduped data index files. I just wanted to mention that the index files for the non deduped pile are available as part of the PolyPythia seeds (in folder 'seed0', which is actually seed 1234).

link: https://huggingface.co/datasets/EleutherAI/pile-preshuffled-seeds/tree/main/seed0

efittschen avatar May 13 '25 23:05 efittschen