Discussion about dataset preparation speed
🎯 Description
With 1,250 bin files, dataset preparation takes around 6.5 minutes on 100 cores, 1TB RAM, and 8 GPUs.
However, if I simply iterate through these files, read document sizes, and write them to another file to simulate basic indexing, it takes only about 13 seconds on 12 processes (to match the scenario above, with 100/8 processes per GPU).
from fast_llm.data.dataset.gpt.memmap import GPTMemmapDataset
import numpy as np
import time
import yaml
import multiprocessing as mp
def count_tokens_save_doc_sizes(file, out_file):
ds = GPTMemmapDataset("ds", file)
sizes = ds.get_document_sizes().astype(np.int64)
sizes.tofile(out_file)
return np.sum(sizes)
if __name__ == "__main__":
with open("/mnt/datasets/test/denis/fineweb_the_stack_3b.yaml", "rt") as f:
data = yaml.safe_load(f)
files = [el["path"] for el in data["datasets"][0]["datasets"]] + [
el["path"] for el in data["datasets"][1]["datasets"]
]
t0 = time.time()
with mp.Pool(processes=12) as pool:
ttl = sum(
pool.starmap(
count_tokens_save_doc_sizes,
[(file, f"/mnt/datasets/test/denis/{i}.np") for i, file in enumerate(files)],
)
)
print(f"time {time.time()-t0}")
print(len(files), ttl)
So, I wonder—can we further increase the dataset preparation speed?
We'd have to find out where the time is being spent, I suspect tokenization. It's a Hugging Face thing so we don't have much control on it, but it seems we are using tokenizer.encode in a loop rather than batch encoding, so maybe there is some potential gain there.
Sorry, @jlamypoirier, I meant data preparation during training, not data preparation using the prepare command.
When I start training experiment, it takes about 6.5 minutes to process ~1.2k files. I assume this is due to sampling and writing the sampling cache, correct?