I had a mistake when I ran tensorflow with multiple nodes
hi, I had a mistake when I ran tensorflow with multiple nodes
This is my order:
python launch_benchmark.py \
--verbose \ --model-name=resnet50v1_5 \ --precision=fp32 \ --mode=training \ --framework tensorflow \ --noinstall \ --checkpoint=/home/mount_dir/hys/modles/checkpoints \ --data-location=/home/mount_dir/wj/ImageNet/data/tf_images \ --mpi_hostnames='c1,head' \ --mpi_num_processes=4 2>&1
This is the error encountered:
SOCKET_ID: -1
MODEL_NAME: resnet50v1_5
MODE: training
PRECISION: fp32
BATCH_SIZE: -1
NUM_CORES: -1
BENCHMARK_ONLY: True
ACCURACY_ONLY: False
OUTPUT_RESULTS: False
DISABLE_TCMALLOC: True
TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD: 2147483648
NOINSTALL: True
OUTPUT_DIR: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs
MPI_NUM_PROCESSES: 4
MPI_NUM_PEOCESSES_PER_SOCKET: 1
MPI_HOSTNAMES: c1,head
NUMA_CORES_PER_INSTANCE: None
PYTHON_EXE: /opt/intel/oneapi/tensorflow/2.2.0/bin/python
PYTHONPATH:
DRY_RUN:
/bin/sh: numactl: command not found [mpiexec@head] match_arg (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:91): unrecognized argument x [mpiexec@head] Similar arguments: [mpiexec@head] demux [mpiexec@head] s [mpiexec@head] n [mpiexec@head] enable-x [mpiexec@head] f [mpiexec@head] HYD_arg_parse_array (../../../../../src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error [mpiexec@head] mpiexec_get_parameters (../../../../../src/pm/i_hydra/mpiexec/mpiexec_params.c:1350): error parsing input array [mpiexec@head] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1755): error parsing parameters num_inter_threads: 1 num_intra_threads: 26 Received these standard args: Namespace(accuracy_only=False, backbone_model=None, batch_size=64, benchmark_dir='/home/mount_dir/hys/models/benchmarks', benchmark_only=True, checkpoint='/home/mount_dir/hys/modles/checkpoints', data_location='/home/mount_dir/wj/ImageNet/data/tf_images', data_num_inter_threads=None, data_num_intra_threads=None, disable_tcmalloc=True, epochsbtwevals=1, experimental_gelu=False, framework='tensorflow', input_graph=None, intelai_models='/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5', mode='training', model_args=[], model_name='resnet50v1_5', model_source_dir=None, mpi=None, mpi_hostnames=None, num_cores=-1, num_instances=1, num_inter_threads=1, num_intra_threads=26, num_mpi=1, num_train_steps=1, numa_cores_per_instance=None, optimized_softmax=True, output_dir='/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs', output_results=False, precision='fp32', socket_id=-1, steps=112590, tcmalloc_large_alloc_report_threshold=2147483648, tf_serving_version='master', trainepochs=72, use_case='image_recognition', verbose=True) Received these custom args: [] Current directory: /home/mount_dir/hys/models/benchmarks Running: mpirun -x LD_LIBRARY_PATH -x PYTHONPATH --allow-run-as-root -n 4 -H c1:2,head:2 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 --bind-to none --map-by slot /opt/intel/oneapi/tensorflow/2.2.0/bin/python /home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5/training/mlperf_resnet/imagenet_main.py 2 --batch_size=64 --max_train_steps=112590 --train_epochs=72 --epochs_between_evals=1 --inter_op_parallelism_threads 1 --intra_op_parallelism_threads 26 --version 1 --resnet_size 50 --data_dir=/home/mount_dir/wj/ImageNet/data/tf_images --model_dir=/home/mount_dir/hys/modles/checkpoints PYTHONPATH: :/home/mount_dir/hys/models/benchmarks/../models/common/tensorflow:/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5:/home/mount_dir/hys/models/benchmarks:/home/mount_dir/hys/models/benchmarks RUNCMD: /opt/intel/oneapi/tensorflow/2.2.0/bin/python common/tensorflow/run_tf_benchmark.py --framework=tensorflow --use-case=image_recognition --model-name=resnet50v1_5 --precision=fp32 --mode=training --benchmark-dir=/home/mount_dir/hys/models/benchmarks --intelai-models=/home/mount_dir/hys/models/benchmarks/../models/image_recognition/tensorflow/resnet50v1_5 --num-cores=-1 --batch-size=-1 --socket-id=-1 --output-dir=/home/mount_dir/hys/models/benchmarks/common/tensorflow/logs --num-train-steps=1 --benchmark-only --verbose --checkpoint=/home/mount_dir/hys/modles/checkpoints --data-location=/home/mount_dir/wj/ImageNet/data/tf_images --disable-tcmalloc=True Log file location: /home/mount_dir/hys/models/benchmarks/common/tensorflow/logs/benchmark_resnet50v1_5_training_fp32_20210514_163202.log
@houyushan do you still need assistance on this issue?