Unable to reach local dispatcher

Open darthnoward opened this issue 2 years ago • 0 comments

i Have two machine, one with GPU, ip address at 10.42.0.1, another is remote CPU worker, ip address at 192.168.1.136.

remote is running

import tensorflow as tf

 d_config = tf.data.experimental.service.DispatcherConfig(port=5000)
 dispatcher = tf.data.experimental.service.DispatchServer(d_config)

 w_port = 5001
 w_config = tf.data.experimental.service.WorkerConfig(
     dispatcher_address=dispatcher.target.split("://")[1],
     worker_address="192.168.1.136" + ":" + str(w_port),
     port=w_port)
 worker = tf.data.experimental.service.WorkerServer(w_config)

 dispatcher.join()

local then runs python eval_app_runner.py ctc_asr_app.py /home/haolan/FastFlow/examples/ ff /home/haolan/FastFlow/examples/default_config.yaml --gpu_type=single

with default_config.yaml being

dispatcher_addr: 192.168.1.136
dispatcher_port: 5000
num_profile_steps: 10
num_initial_steps: 5

However, it fails to reach local dispatcher (from what looks like, itself) in certain point with error message being:

2023-12-21 18:12:30.910036: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 158ms.

I'm not sure what causes this, is it because the ip addresses wasn't parsed to show a single one? If so, where should i take a look at to produce a fix to it?

Full Log of local machine

$ python eval_app_runner.py ctc_asr_app.py /home/haolan/FastFlow/examples/ ff /home/haolan/FastFlow/examples/default_config.yaml --gpu_type=single


Args:  Namespace(app_file_path='ctc_asr_app.py', batch=1, data_prefix='/home/haolan/FastFlow/examples/', epochs=2, gpu_type=<GPUType.SINGLE: 'single'>, num_local_workers=1, offloading_type=<OffloadingType.FASTFLOW: 'ff'>, parallel=-1, yaml_path='/home/haolan/FastFlow/examples/default_config.yaml')
2023-12-21 18:10:42.553745: I tensorflow/core/data/service/dispatcher_impl.cc:192] Running with fault_tolerant_mode=False. The dispatcher will not be able to recover its state on restart.
2023-12-21 18:10:42.553759: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data DispatchServer running at 0.0.0.0:5000
Launch local worker
2023-12-21 18:10:42.566467: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at 10.42.0.1:5000
2023-12-21 18:10:42.566504: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5001
Launch local worker
2023-12-21 18:10:42.572939: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at 192.168.1.136:5000
2023-12-21 18:10:42.572975: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5501
2023-12-21 18:10:42.609152: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-21 18:10:42.624489: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-21 18:10:43.134878: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 43635 MB memory:  -> device: 0, name: NVIDIA RTX A6000, pci bus id: 0000:01:00.0, compute capability: 8.6
2023-12-21 18:10:43.135153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:939] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-12-21 18:10:43.135224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 43646 MB memory:  -> device: 1, name: NVIDIA RTX A6000, pci bus id: 0000:02:00.0, compute capability: 8.6
The vocabulary is: ['', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'", '?', '!', ' '] (size =31)
Size of the training set: 11790
Size of the validation set: 1310
[build_model] input_spectrogram: KerasTensor(type_spec=TensorSpec(shape=(None, None, 193), dtype=tf.float32, name='DeepSpeech-2input'), name='DeepSpeech-2input', description="created by layer 'DeepSpeech-2input'")
() {'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
[build_model] input_spectrogram: KerasTensor(type_spec=TensorSpec(shape=(None, None, 193), dtype=tf.float32, name='DeepSpeech-2-copyinput'), name='DeepSpeech-2-copyinput', description="created by layer 'DeepSpeech-2-copyinput'")
()
{'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
() {'optimizer': <keras.optimizer_v2.adam.Adam object at 0x7f6da6d05580>, 'loss': <function CTCLoss at 0x7f6d383d6820>}
<WeakKeyDictionary at 0x7f6cf4e97fd0>
0. Dummy training
2023-12-21 18:10:50.192211: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8907
2023-12-21 18:10:51.967789: I tensorflow/stream_executor/cuda/cuda_blas.cc:1774] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
1/1 [==============================] - 11s 11s/step - loss: 1829.3364
Measure ProfileMetrics.LTHP
10/10 [==============================] - 47s 5s/step - loss: 526.4906
Measure ProfileMetrics.GTHP
A builder instance for a PrefechDataset is being created.
prefetch is being applied.
10/10 [==============================] - 4s 396ms/step - loss: 334.6863
Does this app have a cpu bottleneck?  Yes
Measure ProfileMetrics.RTHP
A builder instance for a PrefechDataset is being created.
A builder instance for a PaddedBatchDataset is being created.
padded batch is being applied.
prefetch is being applied.
2023-12-21 18:11:48.122148: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:300] New iterator created 1 for job 0
2023-12-21 18:11:48.122181: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:325] Connecting to 192.168.1.136:5000 in FastFlowOffloadingFetch op
2023-12-21 18:11:48.225582: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4001
2023-12-21 18:11:48.226247: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4001 with processing mode sharding_policy: DYNAMIC

2023-12-21 18:11:48.238079: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:594] Starting FastFlowOp task thread manager
10/10 [==============================] - 21s 2s/step - loss: 314.9906
2023-12-21 18:12:08.963013: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3000
2023-12-21 18:12:08.963148: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:608] Task thread manager finished
2023-12-21 18:12:08.963159: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:609] Finished.. task size 2 finished_tasks: 0 num_local_request: 0 num_remote_request: 328 outstanding: 0 results: 0
2023-12-21 18:12:08.963249: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:304] Destroying data service dataset iterator 1 for job id 3000
2023-12-21 18:12:08.963259: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3000
Measure ProfileMetrics.RTHP_BATCH
A builder instance for a PrefechDataset is being created.
prefetch is being applied.
2023-12-21 18:12:09.114261: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:300] New iterator created 1 for job 0
2023-12-21 18:12:09.114274: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:325] Connecting to 192.168.1.136:5000 in FastFlowOffloadingFetch op
2023-12-21 18:12:09.192091: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4003
2023-12-21 18:12:09.192651: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4003 with processing mode sharding_policy: DYNAMIC

2023-12-21 18:12:09.205300: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:594] Starting FastFlowOp task thread manager
10/10 [==============================] - 21s 2s/step - loss: 309.4245
2023-12-21 18:12:30.330142: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3001
2023-12-21 18:12:30.330282: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:608] Task thread manager finished
2023-12-21 18:12:30.330294: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:609] Finished.. task size 2 finished_tasks: 0 num_local_request: 0 num_remote_request: 11 outstanding: 0 results: 0
2023-12-21 18:12:30.330481: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:304] Destroying data service dataset iterator 1 for job id 3001
2023-12-21 18:12:30.330494: I tensorflow/core/kernels/data/experimental/fastflow_offloading_fetch_op.cc:550] Cancel threads iterator 1 for job 3001
Measure ProfileMetrics.RTHP_MID
A builder instance for a PrefechDataset is being created.
A builder instance for a PaddedBatchDataset is being created.
A builder instance for a ParallelMapDataset is being created.
2023-12-21 18:12:30.910036: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 158ms.
2023-12-21 18:12:31.068586: I tensorflow/core/data/service/grpc_util.cc:68] Failed to check service version: UNAVAILABLE: Failed to get dispatcher version from dispatcher running at 10.42.0.1 172.17.37.106 10.12.146.252 172.17.0.1 192.168.1.125 100.104.160.22 172.22.2.2 fd97:8600:8edd:0:215d:6f4d:96a3:ab0c fd97:8600:8edd:0:e105:6cf9:5b33:c950 fd97:8600:8edd:0:d376:e947:bac6:dc12 fd97:8600:8edd:0:c705:b135:faa7:b989 fd97:8600:8edd:0:6805:9ee2:f28a:c366 fd97:8600:8edd:0:3430:95e4:3bf:8a50 fd97:8600:8edd:0:e173:3808:e1a7:630b fd97:8600:8edd::15b fd97:8600:8edd:0:58c2:e087:eb4a:5b7 fd7a:115c:a1e0:ab12:4843:cd96:6268:a016:5000: DNS resolution failed. Will retry in 230ms.

full log of remote machine

2023-12-21 20:33:32.445696: I tensorflow/core/data/service/dispatcher_impl.cc:192] Running with fault_tolerant_mode=False. The dispatcher will not be able to recover its state on restart.
2023-12-21 20:33:32.445716: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data DispatchServer running at 0.0.0.0:5000
['grpc', 'localhost:5000']
2023-12-21 20:33:32.447133: I tensorflow/core/data/service/worker_impl.cc:150] Worker registered with dispatcher running at localhost:5000
2023-12-21 20:33:32.447200: I tensorflow/core/data/service/server_lib.cc:64] Started tf.data WorkerServer running at 0.0.0.0:5001
2023-12-21 20:34:49.156332: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-21 20:34:49.199302: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4000 to worker 192.168.1.136:5001
2023-12-21 20:34:49.207790: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4000
2023-12-21 20:34:49.209143: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4000 with processing mode sharding_policy: DYNAMIC

2023-12-21 20:34:49.209399: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4000 to worker 192.168.1.136:5001
2023-12-21 20:34:49.209630: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4001 to worker 10.42.0.1:5501
2023-12-21 20:34:49.239816: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4001 to worker 10.42.0.1:5501
2023-12-21 20:35:10.186330: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4002 to worker 192.168.1.136:5001
2023-12-21 20:35:10.191330: I tensorflow/core/data/service/worker_impl.cc:257] Received request to process task 4002
2023-12-21 20:35:10.193713: I tensorflow/core/data/service/worker_impl.cc:270] Began processing for task 4002 with processing mode sharding_policy: DYNAMIC

2023-12-21 20:35:10.194258: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4002 to worker 192.168.1.136:5001
2023-12-21 20:35:10.194532: I tensorflow/core/data/service/dispatcher_impl.cc:822] Started assigning task 4003 to worker 10.42.0.1:5501
2023-12-21 20:35:10.221530: I tensorflow/core/data/service/dispatcher_impl.cc:849] Finished assigning task 4003 to worker 10.42.0.1:5501

Dec 21 '23 12:12 darthnoward