Unexpected keyword argument 'image_size' when running benchmark
Hey folks!
I saw that someone asked the same question yesterday on the mailinglist, but nobody has answered so I thought I bring it here since I'm running into the same issue.
When I try to run the benchmark, the process stops when trying to read the training data and complains about a "TypeError: Profile.update() got an unexpected keyword argument 'image_size'".
Generated the data as
./benchmark.sh datagen --hosts <IP> --workload unet3d --accelerator-type a100 --num-parallel 8 --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data
And running the benchmark fails as
HYDRA_FULL_ERROR=1 ./benchmark.sh run --hosts <IP> --workload unet3d --accelerator-type a100 --num-accelerators 1 --results-dir resultsdir --param dataset.num_files_train=3500 --param dataset.data_folder=unet3d_data
[INFO] 2024-08-13T16:49:38.250049 Running DLIO with 1 process(es) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:100]
[INFO] Total amount of data each host will consume is 477.86366008222103 GB; each host has [30.648590087890625] GB memory [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:121]
[INFO] 2024-08-13T16:49:43.983959 Max steps per epoch: 500 = 1 * 3500 / 7 / 1 (samples per file * num files / batch size / comm size) [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py:321]
[INFO] 2024-08-13T16:49:43.998867 Starting epoch 1: 500 steps expected [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:192]
[INFO] 2024-08-13T16:49:44.009325 Starting block 1 [/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/utils/statscounter.py:264]
Error executing job with overrides: ['workload=unet3d_a100', '++workload.workflow.generate_data=False', '++workload.workflow.train=True', '++workload.dataset.num_files_train=3500', '++workload.dataset.data_folder=unet3d_data', '++workload.workflow.profiling=False', '++workload.profiling.profiler=none']
Traceback (most recent call last):
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 402, in <module>
main()
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
^^^^^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 397, in main
benchmark.run()
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 343, in run
steps = self._train(epoch)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/main.py", line 263, in _train
for batch in dlp.iter(loader.next()):
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 174, in next
for batch in self._dataset:
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
return self._process_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
data.reraise()
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/_utils.py", line 706, in reraise
raise exception
TypeError: Caught TypeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/venv/lib/python3.12/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/data_loader/torch_data_loader.py", line 84, in __getitem__
return self.reader.read_index(image_idx, step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 57, in read_index
return super().read_index(image_idx, step)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/reader_handler.py", line 114, in read_index
self.get_sample(filename, sample_index)
File "/home/ubuntu/benchmark/storage/dlio_benchmark/dlio_benchmark/reader/npz_reader.py", line 48, in get_sample
dlp.update(image_size=image.nbytes)
TypeError: Profile.update() got an unexpected keyword argument 'image_size'
Running the benchmark in a venv with Python 3.12.3, Ubuntu 24.04 LTS, on kernel 6.8.0-1009-aws #9-Ubuntu SMP Fri May 17 14:39:23 UTC 2024.
Anyone an idea? It feels like the data format is wrong, but not sure.
Tried tag v1.0.1 and it still happens. Also tried to update all OS packages but still nothing. The parameter "image_size" doesn't exist in Profiler.update(...). At least not in the referenced dlio_benchmark commits 🤔
diff --git a/dlio_benchmark/utils/utility.py b/dlio_benchmark/utils/utility.py
index 8872f2e..267ba19 100644
--- a/dlio_benchmark/utils/utility.py
+++ b/dlio_benchmark/utils/utility.py
@@ -49,7 +49,7 @@ except:
return
def __exit__(self, type, value, traceback):
return
- def update(self, *, epoch=0, step=0, size=0, default=None):
+ def update(self, *, epoch=0, step=0, size=0, default=None, image_size=0):
return
class dftracer(object):
def __init__(self,):
That fixes the issue. Not sure if "image_size" is supposed to be "size" or the other way around, but just adding it (since the test isn't using a profiler) is the easiest fix.
You can do pip install -r requirements.txt to fix the issue
This is a current bug to DLIO if dftracer is not installed.
So when switching over to 1.0.1, please make sure to do pip install -r requirements.txt
For unet3d runs, should 1.0 or 1.0.1 be used before running pip install -r requirements.txt?
This is moot (no longer relevant) so is being closed.