sajikvr
sajikvr
used this workaround, which seems to be working `export PMIX_MCA_gds=hash`
@FileSystemGuy Thought the reason for adding the custom Accelerator distribution is to allow heterogeneous hosts to give more flexibility for testing, isn't it ?
@zhenghh04 Could not get the patch installed. ``` (myenv) nutanix@clientvm1:~$ pip install --upgrade \ git+https://github.com/argonne-lcf/dlio_benchmark.git@bugfix/inhomogeneous_setup#egg=dlio_benchmark Collecting dlio_benchmark Cloning https://github.com/argonne-lcf/dlio_benchmark.git (to revision bugfix/inhomogeneous_setup) to /tmp/pip-install-ll2qfvrp/dlio-benchmark_96241176733f44e697bbc0241ba7fc41 Running command git clone --filter=blob:none --quiet...
Tried again, looks like it is working, could use 5 clients to host 19 gpus mlpstorage training run --hosts 10.57.205.80:4,10.57.205.82:4,10.57.205.84:4,10.57.205.85:4,10.57.205.86:3 --model unet3d --data-dir /mnt/data --params reader.read_threads=10 dataset.num_files_train=70015 dataset.num_subfolders_train=200 checkpoint.checkpoint_folder=/mnt/data reader.odirect=true...