datasets icon indicating copy to clipboard operation
datasets copied to clipboard

fix: loading of datasets from Disk(#7373)

Open sam-hey opened this issue 10 months ago • 6 comments

Fixes dataset loading from disk by ensuring that memory maps and streams are properly closed.

For more details, see https://github.com/huggingface/datasets/issues/7373.

sam-hey avatar Mar 29 '25 16:03 sam-hey

@nepfaff Could you confirm if this fixes the issue for you? I checked Memray, and everything looked good on my end.

Install: pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets

sam-hey avatar Mar 31 '25 11:03 sam-hey

Will aim to get to this soon. I don't have a rapid testing pipeline setup but need to wait for some AWS nodes to become free

nepfaff avatar Apr 02 '25 02:04 nepfaff

I now set up a small experiment:

# Log initial RAM usage
    process = psutil.Process(os.getpid())
    initial_ram = process.memory_info().rss / (1024 * 1024)  # Convert to MB
    logging.info(f"Initial RAM usage: {initial_ram:.2f} MB")

    chunk_datasets = [
        Dataset.load_from_disk(dataset_path, keep_in_memory=False) for _ in range(N)
    ]
    combined_dataset = concatenate_datasets(chunk_datasets)

    # Log final RAM usage
    final_ram = process.memory_info().rss / (1024 * 1024)  # Convert to MB
    ram_diff = final_ram - initial_ram
    logging.info(f"Final RAM usage: {final_ram:.2f} MB")
    logging.info(f"RAM usage increase: {ram_diff:.2f} MB")

The RAM usage is linearly correlated with N on datasets master!

For my test dataset:

  • N=5 => RAM usage increase: 26302.91 MB
  • N=10 => RAM usage increase: 52315.18 MB
  • N=20 => RAM usage increase: 104510.65 MB
  • N=40 => RAM usage increase: 209166.30 MB

Unfortunately, your patch doesn't seem to change this:

pip install git+https://github.com/sam-hey/datasets.git@fix/concatenate_datasets
pip list | grep datasets
datasets                 3.5.1.dev0

Gives exactly the same RAM statistics.

Edit: The results are a bit flawed as the memory increase all seems to come from Dataset.load_from_disk(dataset_path, keep_in_memory=False) here (which I don't think should happen either?) and not from concatenate_datasets. This seems different from my large-scale setup that runs out of memory during concatenate_datasets but I don't seem to be able to replicate this here...

nepfaff avatar Apr 03 '25 03:04 nepfaff

Thanks a lot, @nepfaff, for taking a look at this! It seems that concatenate_datasets() is fixed with this PR. I can also confirm that loading a large number of files requires significant memory. However, as I understand it, this is expected/a bug since the memory consumption stems from pa.memory_map(), which returns a memory-mapped file.

This behavior might be related to this bug: https://github.com/apache/arrow/issues/34423

Screenshot 2025-04-03 at 16 01 11

sam-hey avatar Apr 03 '25 13:04 sam-hey

Great ! have you tested that it also fixes the memory issue in your case @iamollas ?

Happy to know that it works for you @sam-hey ! Looking forward to merging this

lhoestq avatar Apr 24 '25 16:04 lhoestq

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.