Fast-LLM icon indicating copy to clipboard operation
Fast-LLM copied to clipboard

Discussion about dataset preparation speed

Open bigximik opened this issue 11 months ago • 2 comments

🎯 Description

With 1,250 bin files, dataset preparation takes around 6.5 minutes on 100 cores, 1TB RAM, and 8 GPUs.

However, if I simply iterate through these files, read document sizes, and write them to another file to simulate basic indexing, it takes only about 13 seconds on 12 processes (to match the scenario above, with 100/8 processes per GPU).

from fast_llm.data.dataset.gpt.memmap import GPTMemmapDataset
import numpy as np
import time

import yaml
import multiprocessing as mp


def count_tokens_save_doc_sizes(file, out_file):
    ds = GPTMemmapDataset("ds", file)
    sizes = ds.get_document_sizes().astype(np.int64)
    sizes.tofile(out_file)
    return np.sum(sizes)


if __name__ == "__main__":
    with open("/mnt/datasets/test/denis/fineweb_the_stack_3b.yaml", "rt") as f:
        data = yaml.safe_load(f)

    files = [el["path"] for el in data["datasets"][0]["datasets"]] + [
        el["path"] for el in data["datasets"][1]["datasets"]
    ]

    t0 = time.time()

    with mp.Pool(processes=12) as pool:
        ttl = sum(
            pool.starmap(
                count_tokens_save_doc_sizes,
                [(file, f"/mnt/datasets/test/denis/{i}.np") for i, file in enumerate(files)],
            )
        )

    print(f"time {time.time()-t0}")

    print(len(files), ttl)

So, I wonder—can we further increase the dataset preparation speed?

bigximik avatar Mar 25 '25 15:03 bigximik

We'd have to find out where the time is being spent, I suspect tokenization. It's a Hugging Face thing so we don't have much control on it, but it seems we are using tokenizer.encode in a loop rather than batch encoding, so maybe there is some potential gain there.

jlamypoirier avatar Mar 26 '25 23:03 jlamypoirier

Sorry, @jlamypoirier, I meant data preparation during training, not data preparation using the prepare command.

When I start training experiment, it takes about 6.5 minutes to process ~1.2k files. I assume this is due to sampling and writing the sampling cache, correct?

bigximik avatar Mar 31 '25 06:03 bigximik