datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Excessive RAM Usage After Dataset Concatenation concatenate_datasets

Open sam-hey opened this issue 1 year ago • 3 comments

Describe the bug

When loading a dataset from disk, concatenating it, and starting the training process, the RAM usage progressively increases until the kernel terminates the process due to excessive memory consumption.

https://github.com/huggingface/datasets/issues/2276

Steps to reproduce the bug

from datasets import  DatasetDict, concatenate_datasets

dataset = DatasetDict.load_from_disk("data")

...
...

combined_dataset = concatenate_datasets(
        [dataset[split] for split in dataset]
    )

#start SentenceTransformer training

Expected behavior

I would not expect RAM utilization to increase after concatenation. Removing the concatenation step resolves the issue

Environment info

sentence-transformers==3.1.1 datasets==3.2.0

python3.10

sam-hey avatar Jan 16 '25 16:01 sam-hey

Image

Image

Adding a img from memray https://gist.github.com/sam-hey/00c958f13fb0f7b54d17197fe353002f

sam-hey avatar Jan 17 '25 07:01 sam-hey

I'm having the same issue where concatenation seems to use a huge amount of RAM.

# Load all chunks and concatenate them into a final dataset.
        chunk_datasets = [
            Dataset.load_from_disk(file, keep_in_memory=False)
            for file in tqdm(chunk_files, desc="Loading chunk datasets")
        ]
        logging.info("Concatenating chunk datasets...")
        final_dataset = concatenate_datasets(chunk_datasets)

This is a real issue for me as the final dataset is a few terabytes in size. I'm using datasets version 3.1.0. Also tested with version 3.4.1

nepfaff avatar Mar 26 '25 14:03 nepfaff

I did have a short look, the error seems to be from memory_map and the stream not being closed.

https://github.com/huggingface/datasets/blob/5f8d2ad9a1b0bccfd962d998987228addfd5be9f/src/datasets/table.py#L48-L50

Did not have the time to test jet: https://github.com/sam-hey/datasets/tree/fix/concatenate_datasets

I will probably have a better look in a couple of days.

sam-hey avatar Mar 27 '25 17:03 sam-hey