GRPC error on worker node if you sequentially submit multiple training commands
2020-01-12 19:54:43.426092: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-01-12 19:54:43.432883: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2020-01-12 19:54:43.433201: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a60590 executing computations on platform Host. Devices:
2020-01-12 19:54:43.433221: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/elasticdl/python/worker/main.py", line 39, in
Can you give out more detailed repro steps? How to sequentially submit multiple training commands?
Lets take the piece from scripts/client_test.sh for example:
elasticdl train \ --image_base=elasticdl:ci \ --model_zoo=model_zoo \ --model_def=deepfm_functional_api.deepfm_functional_api.custom_model \ --training_data=/data/frappe/train \ --validation_data=/data/frappe/test \ --num_epochs=1 \ --master_resource_request="cpu=0.2,memory=1024Mi" \ --master_resource_limit="cpu=1,memory=2048Mi" \ --worker_resource_request="cpu=0.4,memory=2048Mi" \ --worker_resource_limit="cpu=1,memory=3072Mi" \ --ps_resource_request="cpu=0.2,memory=1024Mi" \ --ps_resource_limit="cpu=1,memory=2048Mi" \ --minibatch_size=64 \ --num_minibatches_per_task=2 \ --num_workers=$WORKER_NUM \ --num_ps_pods=$PS_NUM \ --checkpoint_steps=500 \ --evaluation_steps=500 \ --tensorboard_log_dir=/tmp/tensorboard-log \ --grads_to_wait=1 \ --use_async=True \ --job_name=test-train \ --log_level=INFO \ --image_pull_policy=Never \ --output=/saved_model/model_output \ --volume="host_path=${PWD},mount_path=/saved_model"
I can paste this command in a shell (say, test.sh), change the job_name and run 'nohup test.sh &'. This will submit multiple training requests.
Can you still repro this issue with the current ElasticDL version?