Excessive RAM Usage After Dataset Concatenation concatenate_datasets
Describe the bug
When loading a dataset from disk, concatenating it, and starting the training process, the RAM usage progressively increases until the kernel terminates the process due to excessive memory consumption.
https://github.com/huggingface/datasets/issues/2276
Steps to reproduce the bug
from datasets import DatasetDict, concatenate_datasets
dataset = DatasetDict.load_from_disk("data")
...
...
combined_dataset = concatenate_datasets(
[dataset[split] for split in dataset]
)
#start SentenceTransformer training
Expected behavior
I would not expect RAM utilization to increase after concatenation. Removing the concatenation step resolves the issue
Environment info
sentence-transformers==3.1.1 datasets==3.2.0
python3.10
Adding a img from memray https://gist.github.com/sam-hey/00c958f13fb0f7b54d17197fe353002f
I'm having the same issue where concatenation seems to use a huge amount of RAM.
# Load all chunks and concatenate them into a final dataset.
chunk_datasets = [
Dataset.load_from_disk(file, keep_in_memory=False)
for file in tqdm(chunk_files, desc="Loading chunk datasets")
]
logging.info("Concatenating chunk datasets...")
final_dataset = concatenate_datasets(chunk_datasets)
This is a real issue for me as the final dataset is a few terabytes in size. I'm using datasets version 3.1.0. Also tested with version 3.4.1
I did have a short look, the error seems to be from memory_map and the stream not being closed.
https://github.com/huggingface/datasets/blob/5f8d2ad9a1b0bccfd962d998987228addfd5be9f/src/datasets/table.py#L48-L50
Did not have the time to test jet: https://github.com/sam-hey/datasets/tree/fix/concatenate_datasets
I will probably have a better look in a couple of days.