fix: loading of datasets from Disk(#7373)
Fixes dataset loading from disk by ensuring that memory maps and streams are properly closed.
For more details, see https://github.com/huggingface/datasets/issues/7373.
@nepfaff Could you confirm if this fixes the issue for you? I checked Memray, and everything looked good on my end.
Install: pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets
Will aim to get to this soon. I don't have a rapid testing pipeline setup but need to wait for some AWS nodes to become free
I now set up a small experiment:
# Log initial RAM usage
process = psutil.Process(os.getpid())
initial_ram = process.memory_info().rss / (1024 * 1024) # Convert to MB
logging.info(f"Initial RAM usage: {initial_ram:.2f} MB")
chunk_datasets = [
Dataset.load_from_disk(dataset_path, keep_in_memory=False) for _ in range(N)
]
combined_dataset = concatenate_datasets(chunk_datasets)
# Log final RAM usage
final_ram = process.memory_info().rss / (1024 * 1024) # Convert to MB
ram_diff = final_ram - initial_ram
logging.info(f"Final RAM usage: {final_ram:.2f} MB")
logging.info(f"RAM usage increase: {ram_diff:.2f} MB")
The RAM usage is linearly correlated with N on datasets master!
For my test dataset:
- N=5 => RAM usage increase: 26302.91 MB
- N=10 => RAM usage increase: 52315.18 MB
- N=20 => RAM usage increase: 104510.65 MB
- N=40 => RAM usage increase: 209166.30 MB
Unfortunately, your patch doesn't seem to change this:
pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets
pip list | grep datasets
datasets 3.5.1.dev0
Gives exactly the same RAM statistics.
Edit: The results are a bit flawed as the memory increase all seems to come from Dataset.load_from_disk(dataset_path, keep_in_memory=False) here (which I don't think should happen either?) and not from concatenate_datasets. This seems different from my large-scale setup that runs out of memory during concatenate_datasets but I don't seem to be able to replicate this here...
Thanks a lot, @nepfaff, for taking a look at this! It seems that concatenate_datasets() is fixed with this PR. I can also confirm that loading a large number of files requires significant memory. However, as I understand it, this is expected/a bug since the memory consumption stems from pa.memory_map(), which returns a memory-mapped file.
This behavior might be related to this bug: https://github.com/apache/arrow/issues/34423
Great ! have you tested that it also fixes the memory issue in your case @iamollas ?
Happy to know that it works for you @sam-hey ! Looking forward to merging this
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.