[Solved] Bug: socket.timeout: timed out. Server failed to respond to requests
Hello,
I have been experimenting on triton inference server. I found the server sometime failed to respond to requests. The client keeps raising socket.timeout: timed out error even if I try to catch the InferenceServerException and resend the request for up to 10 times.
What can I do to fix this?
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/cwman/work/repos/jumper/services/triton/pc_det/dispatcher/1/model.py", line 61, in infer_mp
processor.run_mp_triton(thid, numworkers)
File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 558, in run_mp_triton
err_code = self.process_single(scan_stamp)
File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 518, in process_single
scan_data, boxes_lidar, label_boxes, predicted_scores = self.process_single_core(scan_stamp)
File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 410, in process_single_core
boxes_lidar,label_boxes, predicted_scores = self._model.model_infer(scan_data)
File "/home/cwman/work/repos/jumper/services/triton/pc_det/dispatcher/1/model.py", line 47, in model_infer
outs = super().request([pts[None], tta])['OUTPUT']
File "/home/cwman/work/repos/jumper/services/triton/client_base.py", line 60, in request
response = self.infer(self.model_name,
File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 1414, in infer
response = self._post(request_uri=request_uri,
File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 309, in _post
response = self._client_stub.post(request_uri=request_uri,
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post
return self.request(METHOD_POST, request_uri, body=body, headers=headers)
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 253, in request
response = HTTPSocketPoolResponse(sock, self._connection_pool,
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 298, in __init__
super(HTTPSocketPoolResponse, self).__init__(sock, **kw)
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 170, in __init__
self._read_headers()
File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 190, in _read_headers
data = self._sock.recv(self.block_size)
File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 663, in recv
self._wait(self._read_event)
File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
socket.timeout: timed out
Can you reproduce the failure with grpc client? cc @jbkyang-nvi
I implemented the http client only. I can provide the config.pbtxt for the model:
name: "main"
backend: "python"
max_batch_size: 1
input [
{
name: "POINTS"
data_type: TYPE_FP32
dims: [ -1,4 ]
} #,
# {
# name: "TTA"
# data_type: TYPE_BOOL
# dims: [ 1 ]
# }
]
output [
{
name: "OUTPUT"
data_type: TYPE_FP32
dims: [ -1, 9]
}
]
# Specify GPU instance.
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
Here is how I send request:
import os
from typing import List, Union
import numpy as np
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient.utils import np_to_triton_dtype, triton_to_np_dtype
class BaseHttpClient(httpclient.InferenceServerClient):
def __init__(self, url, model_name,
verbose=False,
concurrency=1,
connection_timeout=600.0,
network_timeout=600.0,
max_greenlets=None,
ssl=False,
ssl_options=None,
ssl_context_factory=None,
insecure=False):
super().__init__(
url,
verbose=verbose,
concurrency=concurrency,
connection_timeout=connection_timeout,
network_timeout=network_timeout,
max_greenlets=max_greenlets,
ssl=ssl,
ssl_options=ssl_options,
ssl_context_factory=ssl_context_factory,
insecure=insecure)
self.model_name = model_name
self.model_meta = self.get_model_metadata(self.model_name)
def request(self, inp_data: List[np.ndarray]):
"""request inputs to model server.
Args:
inp_data (List[np.ndarray]): list of batch inputs.
"""
inputs = []
for inp, inp_meta in zip(inp_data, self.model_meta['inputs']):
infer_inp = httpclient.InferInput(
inp_meta['name'], inp.shape, inp_meta['datatype'])
infer_inp.set_data_from_numpy(inp.astype(
triton_to_np_dtype(inp_meta['datatype'])))
inputs.append(infer_inp)
outputs = [httpclient.InferRequestedOutput(
out_meta['name']) for out_meta in self.model_meta['outputs']]
response = self.infer(self.model_name,
inputs,
request_id=str(1),
outputs=outputs)
result = response.get_response()
return {res['name']: response.as_numpy(res['name']) for res in result['outputs']}
def shm_request(self, inp_data: List[np.ndarray]):
"""request inputs by shm to model server.
Args:
inp_data (List[np.ndarray]): list of batch inputs.
"""
inputs = []
shm_inp_handles = []
for ninp, (inp, inp_meta) in enumerate(zip(inp_data, self.model_meta['inputs'])):
shm_name = "{}_inputdata_{}".format(self.model_name, ninp)
shm_key = "{}_input_{}".format(self.model_name, ninp)
input_byte_size = inp.size * inp.itemsize
self.unregister_system_shared_memory()
self.unregister_cuda_shared_memory()
# Create Input in Shared Memory and store shared memory handles
shm_handle = shm.create_shared_memory_region(
shm_name, shm_key, input_byte_size)
# Put input data values into shared memory
shm.set_shared_memory_region(shm_handle, [inp])
# Register Input shared memory with Triton Server
self.register_system_shared_memory(
shm_name, shm_key, input_byte_size)
# Set the parameters to use data from shared memory
infer_inp = httpclient.InferInput(
inp_meta['name'], inp.shape, inp_meta['datatype'])
infer_inp.set_shared_memory(shm_name, input_byte_size)
inputs.append(infer_inp)
shm_inp_handles.append(shm_handle)
# TODO shm outputs
outputs = [httpclient.InferRequestedOutput(
out_meta['name']) for out_meta in self.model_meta['outputs']]
response = self.infer(self.model_name,
inputs,
request_id=str(1),
outputs=outputs)
result = response.get_response()
return {res['name']: response.as_numpy(res['name']) for res in result['outputs']}
The infer action is done like:
outs = client.request([pts[None], tta])['OUTPUT']
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
Reinstantiating a client won't solve the problem after extensive tests.
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?
This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.
Reinstantiating a client won't solve the problem after extensive tests.
Oh ! Have you found a way to solve this eternally?
Hi! any updates?
Hi @NLCharles @LittleKai29 and @GabrieleFerrario can you provide the model, the Triton version and the server log while this happens?
Does this happen on all models? Or a specific model you use?
Hi @jbkyang-nvi in my case intially the client work correctly
this is the log of the client:
2022-08-03 09:55:31,169 main.py INFO: Reading Video Stream
2022-08-03 09:55:31,259 main.py INFO: FPS Video: 59
2022-08-03 09:55:39,574 client_triton.py INFO: Client Triton HTTP
2022-08-03 09:55:39,662 main.py INFO: Start Loop
2022-08-03 09:55:40,038 client_triton.py INFO: Inference Triton! --> first request to triton server
2022-08-03 09:55:40,705 client_triton.py INFO: Inference time triton: 0.6665s --> time required for triton_client.infer() in the first iteration
2022-08-03 10:02:29,848 main.py INFO: Time preprocess video: 152.38513016700745s
2022-08-03 10:04:03,252 violence_detection_model.py INFO: Inference time second model: 32.424s
2022-08-03 10:04:03,254 main.py INFO: Result second model: (0.73)
2022-08-03 10:04:06,073 client_triton.py INFO: Inference Triton! --> second request to triton server
Traceback (most recent call last):
File "main.py", line 397, in
but after the triton inference i use another model (ONNX model not in triton) that require more time (152s + 32s in this case for the first execution) and in the second triton client request i get the error in the client. This is the log of the server: 2022-08-03 10:28:57.712186: W tensorflow/core/common_runtime/bfc_allocator.cc:311] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
Before the previous log the triton serve print many times this log: 2022-08-03 10:28:57.016308: W tensorflow/core/common_runtime/bfc_allocator.cc:425] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB (rounded to 2785755136). If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
For triton serve i follow this guide: https://developer.ridgerun.com/wiki/index.php/Tritonserver_support_for_NVIDIA_Jetson_Platforms and the version should be 2.16.
The model that I use in the model repostory is: FasterRCNN-InceptionV2; follow this link: https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/
I don't think it's memory problems because I've only tried the first part that uses the triton client to do inference and it works correctly, maybe it's due to the fact that it's been several times since the second inference and maybe the client goes into timeout? or something related to timing because in the first iteration everything works correctly.
Thanks for the support!!!!!
@GabrieleFerrario the server should not hang if the GPU OOMs however this line in the server:
2022-08-03 10:28:57.016308: W tensorflow/core/common_runtime/bfc_allocator.cc:425] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB (rounded to 2785755136).
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
indicates there's a backend issue. Can you
- Try to run the first request multiple times to see if this is either related to the second request being too large or the onnx model not deallocating memory in time for the 2nd TF request.
- Try to run this script in a different machine and see if the issue persists?
Hi, @jbkyang-nvi
could it be that the ONNX model does not deallocate the memory in time for the second TF request, how should I do to free it? I tried running this scripts on a different machine and it works correctly. Thanks!
Hi, @jbkyang-nvi
could it be that the ONNX model does not deallocate the memory in time for the second TF request, how should I do to free it? I tried running this scripts on a different machine and it works correctly. Thanks!
@GabrieleFerrario you can try to isolate the issue by just running your script and not running your onnx model while providing the same request inputs? With respect to making onnx release the GPU memory, you can search around the onnx github to find the right answer for that
Hi @jbkyang-nvi, ok I will try and let you know as soon as possible. Thanks for the support!
Hi @jbkyang-nvi, ok I will try and let you know as soon as possible. Thanks for the support!
Hi, @GabrieleFerrario I have run into the same issue. Could you share how to solve this problem?
Hi @zhanghaoie, In my case, the problem was that my NVIDIA device was full of processes that had to be running in parallel and I had to fix the handling of those processes. So as we said above you have to manage the resources of your device
Closing this issue since it seems to be resolved
After reviewing the previous discussion, it seemed that there were no solutions provided. However, upon further investigation, I discovered that the network_timeout parameter was what I was looking for. Adjusting this parameter successfully solved my issue.
Example:
triton_client = httpclient.InferenceServerClient(
url='0.0.0.0:8000',
verbose=True,
network_timeout=600.0, # this works when inference took too long (inference with cpu)
connection_timeout=600.0 # this won't work
)