storage icon indicating copy to clipboard operation
storage copied to clipboard

Unexpected keyword argument 'image_size' when running benchmark

Open noctarius opened this issue 1 year ago • 5 comments

Hey folks!

I saw that someone asked the same question yesterday on the mailinglist, but nobody has answered so I thought I bring it here since I'm running into the same issue.

When I try to run the benchmark, the process stops when trying to read the training data and complains about a "TypeError: Profile.update() got an unexpected keyword argument 'image_size'".

Generated the data as

./benchmark.sh datagen --hosts <IP> --workload unet3d --accelerator-type a100 --num-parallel 8 --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data

And running the benchmark fails as

HYDRA_FULL_ERROR=1 ./benchmark.sh run --hosts <IP> --workload unet3d --accelerator-type a100 --num-accelerators 1 --results-dir resultsdir --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data
[INFO] 2024-08-13T16:49:38.250049 Running DLIO with 1 process(es) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [30.648590087890625] GB memory [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
[INFO] 2024-08-13T16:49:43.983959 Max steps per epoch: 500 = 1 * 3500 / 7 / 1 (samples per file * num files / batch size / comm size) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:321]
[INFO] 2024-08-13T16:49:43.998867 Starting epoch 1: 500 steps expected [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:192]
[INFO] 2024-08-13T16:49:44.009325 Starting block 1 [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
Error executing job with overrides: ['workload=unet3d_a100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=3500', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 402, in <module>
    main()
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 397, in main
    benchmark.run()
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 343, in run
    steps = self._train(epoch)
            ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
    for batch in dlp.iter(loader.next()):
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
    for batch in self._dataset:
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
    data.reraise()
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
    raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
    return self.reader.read_index(image_idx, step)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
    return super().read_index(image_idx, step)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/reader_handler.py", line 114, in read_index
    self.get_sample(filename, sample_index)
  File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 48, in get_sample
    dlp.update(image_size=image.nbytes)
TypeError: Profile.update() got an unexpected keyword argument 'image_size'

Running the benchmark in a venv with Python 3.12.3, Ubuntu 24.04 LTS, on kernel 6.8.0-1009-aws #9-Ubuntu SMP Fri May 17 14:39:23 UTC 2024.

Anyone an idea? It feels like the data format is wrong, but not sure.

noctarius avatar Aug 13 '24 16:08 noctarius

Tried tag v1.0.1 and it still happens. Also tried to update all OS packages but still nothing. The parameter "image_size" doesn't exist in Profiler.update(...). At least not in the referenced dlio_benchmark commits 🤔

noctarius avatar Aug 14 '24 11:08 noctarius

diff --git a/dlio_benchmark/utils/utility.py b/dlio_benchmark/utils/utility.py
index 8872f2e..267ba19 100644
--- a/dlio_benchmark/utils/utility.py
+++ b/dlio_benchmark/utils/utility.py
@@ -49,7 +49,7 @@ except:
             return
         def __exit__(self, type, value, traceback):
             return
-        def update(self, *, epoch=0, step=0, size=0, default=None):
+        def update(self, *, epoch=0, step=0, size=0, default=None, image_size=0):
             return
     class dftracer(object):
         def __init__(self,):

That fixes the issue. Not sure if "image_size" is supposed to be "size" or the other way around, but just adding it (since the test isn't using a profiler) is the easiest fix.

noctarius avatar Aug 14 '24 11:08 noctarius

You can do pip install -r requirements.txt to fix the issue

zhenghh04 avatar Aug 16 '24 15:08 zhenghh04

This is a current bug to DLIO if dftracer is not installed.

So when switching over to 1.0.1, please make sure to do pip install -r requirements.txt

zhenghh04 avatar Aug 16 '24 15:08 zhenghh04

For unet3d runs, should 1.0 or 1.0.1 be used before running pip install -r requirements.txt?

boni-weka avatar Aug 16 '24 15:08 boni-weka

This is moot (no longer relevant) so is being closed.

FileSystemGuy avatar Jun 17 '25 21:06 FileSystemGuy