server [Solved] Bug: socket.timeout: timed out. Server failed to respond to requests

Hello, I have been experimenting on triton inference server. I found the server sometime failed to respond to requests. The client keeps raising socket.timeout: timed out error even if I try to catch the InferenceServerException and resend the request for up to 10 times. What can I do to fix this?

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/cwman/work/repos/jumper/services/triton/pc_det/dispatcher/1/model.py", line 61, in infer_mp
    processor.run_mp_triton(thid, numworkers)
  File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 558, in run_mp_triton
    err_code = self.process_single(scan_stamp)
  File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 518, in process_single
    scan_data, boxes_lidar, label_boxes, predicted_scores = self.process_single_core(scan_stamp)
  File "/home/cwman/work/repos/jumper/jumper/processor/pc_det_processor/core.py", line 410, in process_single_core
    boxes_lidar,label_boxes, predicted_scores = self._model.model_infer(scan_data)
  File "/home/cwman/work/repos/jumper/services/triton/pc_det/dispatcher/1/model.py", line 47, in model_infer
    outs = super().request([pts[None], tta])['OUTPUT']
  File "/home/cwman/work/repos/jumper/services/triton/client_base.py", line 60, in request
    response = self.infer(self.model_name,
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 1414, in infer
    response = self._post(request_uri=request_uri,
  File "/usr/local/lib/python3.8/dist-packages/tritonclient/http/__init__.py", line 309, in _post
    response = self._client_stub.post(request_uri=request_uri,
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 272, in post
    return self.request(METHOD_POST, request_uri, body=body, headers=headers)
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/client.py", line 253, in request
    response = HTTPSocketPoolResponse(sock, self._connection_pool,
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 298, in __init__
    super(HTTPSocketPoolResponse, self).__init__(sock, **kw)
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 170, in __init__
    self._read_headers()
  File "/usr/local/lib/python3.8/dist-packages/geventhttpclient/response.py", line 190, in _read_headers
    data = self._sock.recv(self.block_size)
  File "/usr/local/lib/python3.8/dist-packages/gevent/_socketcommon.py", line 663, in recv
    self._wait(self._read_event)
  File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket
  File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wait
  File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
socket.timeout: timed out

Jun 08 '22 11:06 NLCharles

Can you reproduce the failure with grpc client? cc @jbkyang-nvi

Jun 08 '22 21:06 tanmayv25

I implemented the http client only. I can provide the config.pbtxt for the model:

name: "main"
backend: "python"
max_batch_size: 1

input [
  {
    name: "POINTS"
    data_type: TYPE_FP32
    dims: [ -1,4 ]
  } #,
#   {
#     name: "TTA"
#     data_type: TYPE_BOOL
#     dims: [ 1 ]
#   }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ -1, 9]
  }
]
# Specify GPU instance.
 instance_group [
    {
      count: 1
      kind: KIND_GPU
      gpus: [0]
    }
  ]

Here is how I send request:


import os
from typing import List, Union

import numpy as np
import tritonclient.grpc as grpcclient
import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
from tritonclient.utils import np_to_triton_dtype, triton_to_np_dtype

class BaseHttpClient(httpclient.InferenceServerClient):
    def __init__(self, url, model_name,
                 verbose=False,
                 concurrency=1,
                 connection_timeout=600.0,
                 network_timeout=600.0,
                 max_greenlets=None,
                 ssl=False,
                 ssl_options=None,
                 ssl_context_factory=None,
                 insecure=False):
        super().__init__(
            url,
            verbose=verbose,
            concurrency=concurrency,
            connection_timeout=connection_timeout,
            network_timeout=network_timeout,
            max_greenlets=max_greenlets,
            ssl=ssl,
            ssl_options=ssl_options,
            ssl_context_factory=ssl_context_factory,
            insecure=insecure)
        self.model_name = model_name
        self.model_meta = self.get_model_metadata(self.model_name)

    def request(self, inp_data: List[np.ndarray]):
        """request inputs to model server.

        Args:
            inp_data (List[np.ndarray]): list of batch inputs.
        """
        inputs = []
        for inp, inp_meta in zip(inp_data, self.model_meta['inputs']):
            infer_inp = httpclient.InferInput(
                inp_meta['name'], inp.shape, inp_meta['datatype'])
            infer_inp.set_data_from_numpy(inp.astype(
                triton_to_np_dtype(inp_meta['datatype'])))
            inputs.append(infer_inp)
        outputs = [httpclient.InferRequestedOutput(
            out_meta['name']) for out_meta in self.model_meta['outputs']]
        response = self.infer(self.model_name,
                              inputs,
                              request_id=str(1),
                              outputs=outputs)

        result = response.get_response()
        return {res['name']: response.as_numpy(res['name']) for res in result['outputs']}

    def shm_request(self, inp_data: List[np.ndarray]):
        """request inputs by shm to model server.

        Args:
            inp_data (List[np.ndarray]): list of batch inputs.
        """
        inputs = []
        shm_inp_handles = []
        for ninp, (inp, inp_meta) in enumerate(zip(inp_data, self.model_meta['inputs'])):
            shm_name = "{}_inputdata_{}".format(self.model_name, ninp)
            shm_key = "{}_input_{}".format(self.model_name, ninp)
            input_byte_size = inp.size * inp.itemsize
            self.unregister_system_shared_memory()
            self.unregister_cuda_shared_memory()
            # Create Input in Shared Memory and store shared memory handles
            shm_handle = shm.create_shared_memory_region(
                shm_name, shm_key, input_byte_size)
            # Put input data values into shared memory
            shm.set_shared_memory_region(shm_handle, [inp])
            # Register Input shared memory with Triton Server
            self.register_system_shared_memory(
                shm_name, shm_key, input_byte_size)
            # Set the parameters to use data from shared memory
            infer_inp = httpclient.InferInput(
                inp_meta['name'], inp.shape, inp_meta['datatype'])
            infer_inp.set_shared_memory(shm_name, input_byte_size)
            inputs.append(infer_inp)
            shm_inp_handles.append(shm_handle)
        # TODO  shm outputs
        outputs = [httpclient.InferRequestedOutput(
            out_meta['name']) for out_meta in self.model_meta['outputs']]
        response = self.infer(self.model_name,
                              inputs,
                              request_id=str(1),
                              outputs=outputs)

        result = response.get_response()
        return {res['name']: response.as_numpy(res['name']) for res in result['outputs']}

The infer action is done like:

outs = client.request([pts[None], tta])['OUTPUT']

Jun 09 '22 07:06 NLCharles

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

Jun 09 '22 09:06 NLCharles

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?

Jul 28 '22 02:07 LittleKai29

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

Reinstantiating a client won't solve the problem after extensive tests.

Jul 30 '22 07:07 NLCharles

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

What do you mean about instantiating a new client ? I tried to catch the exception with triton_client = httpclient.InferenceServerClient(url=FLAGS.url, verbose=FLAGS.verbose, concurrency=concurrency) but it's still gets error, am I wrong ?

This is solved by catching the timeout exception in the client infer stage, and instantiating a new client to redo the infer. The hanging client seems unable to proceed by itself.

Reinstantiating a client won't solve the problem after extensive tests.

Oh ! Have you found a way to solve this eternally?

Jul 30 '22 09:07 LittleKai29

Hi! any updates?

Aug 02 '22 12:08 GabrieleFerrario

Hi @NLCharles @LittleKai29 and @GabrieleFerrario can you provide the model, the Triton version and the server log while this happens?

Does this happen on all models? Or a specific model you use?

Aug 03 '22 01:08 jbkyang-nvi

Hi @jbkyang-nvi in my case intially the client work correctly

this is the log of the client: 2022-08-03 09:55:31,169 main.py INFO: Reading Video Stream 2022-08-03 09:55:31,259 main.py INFO: FPS Video: 59 2022-08-03 09:55:39,574 client_triton.py INFO: Client Triton HTTP 2022-08-03 09:55:39,662 main.py INFO: Start Loop 2022-08-03 09:55:40,038 client_triton.py INFO: Inference Triton! --> first request to triton server 2022-08-03 09:55:40,705 client_triton.py INFO: Inference time triton: 0.6665s --> time required for triton_client.infer() in the first iteration 2022-08-03 10:02:29,848 main.py INFO: Time preprocess video: 152.38513016700745s 2022-08-03 10:04:03,252 violence_detection_model.py INFO: Inference time second model: 32.424s 2022-08-03 10:04:03,254 main.py INFO: Result second model: (0.73) 2022-08-03 10:04:06,073 client_triton.py INFO: Inference Triton! --> second request to triton server Traceback (most recent call last): File "main.py", line 397, in check_two_people, count_people = process_frame(img, FLAGS, triton_client, sent_count) File "main.py", line 61, in process_frame response = triton_client.predict(input_request, sent_count, user_data) File "/app/src/client_triton.py", line 251, in predict model_version=self.FLAGS.model_version) File "/usr/local/lib/python3.6/dist-packages/tritonclient/http/init.py", line 1417, in infer query_params=query_params) File "/usr/local/lib/python3.6/dist-packages/tritonclient/http/init.py", line 311, in _post headers=headers) File "/usr/local/lib/python3.6/dist-packages/geventhttpclient/client.py", line 272, in post return self.request(METHOD_POST, request_uri, body=body, headers=headers) File "/usr/local/lib/python3.6/dist-packages/geventhttpclient/client.py", line 254, in request block_size=self.block_size, method=method.upper(), headers_type=self.headers_type) File "/usr/local/lib/python3.6/dist-packages/geventhttpclient/response.py", line 292, in init super(HTTPSocketPoolResponse, self).init(sock, **kw) File "/usr/local/lib/python3.6/dist-packages/geventhttpclient/response.py", line 164, in init self._read_headers() File "/usr/local/lib/python3.6/dist-packages/geventhttpclient/response.py", line 184, in _read_headers data = self._sock.recv(self.block_size) File "/usr/local/lib/python3.6/dist-packages/gevent/_socketcommon.py", line 663, in recv self._wait(self._read_event) File "src/gevent/_hub_primitives.py", line 317, in gevent._gevent_c_hub_primitives.wait_on_socket File "src/gevent/_hub_primitives.py", line 322, in gevent._gevent_c_hub_primitives.wait_on_socket File "src/gevent/_hub_primitives.py", line 313, in gevent._gevent_c_hub_primitives._primitive_wait File "src/gevent/_hub_primitives.py", line 314, in gevent._gevent_c_hub_primitives._primitive_wait File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wa File "src/gevent/_hub_primitives.py", line 46, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wa File "src/gevent/_hub_primitives.py", line 55, in gevent._gevent_c_hub_primitives.WaitOperationsGreenlet.wa File "src/gevent/_waiter.py", line 154, in gevent._gevent_c_waiter.Waiter.get File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenl File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenl File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenl File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._gree socket.timeout: timed out

but after the triton inference i use another model (ONNX model not in triton) that require more time (152s + 32s in this case for the first execution) and in the second triton client request i get the error in the client. This is the log of the server: 2022-08-03 10:28:57.712186: W tensorflow/core/common_runtime/bfc_allocator.cc:311] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.

Before the previous log the triton serve print many times this log: 2022-08-03 10:28:57.016308: W tensorflow/core/common_runtime/bfc_allocator.cc:425] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB (rounded to 2785755136). If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

For triton serve i follow this guide: https://developer.ridgerun.com/wiki/index.php/Tritonserver_support_for_NVIDIA_Jetson_Platforms and the version should be 2.16.

The model that I use in the model repostory is: FasterRCNN-InceptionV2; follow this link: https://developer.nvidia.com/blog/deploying-models-from-tensorflow-model-zoo-using-deepstream-and-triton-inference-server/

I don't think it's memory problems because I've only tried the first part that uses the triton client to do inference and it works correctly, maybe it's due to the fact that it's been several times since the second inference and maybe the client goes into timeout? or something related to timing because in the first iteration everything works correctly.

Thanks for the support!!!!!

Aug 03 '22 10:08 GabrieleFerrario

@GabrieleFerrario the server should not hang if the GPU OOMs however this line in the server:

2022-08-03 10:28:57.016308: W tensorflow/core/common_runtime/bfc_allocator.cc:425] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.59GiB (rounded to 2785755136).
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

indicates there's a backend issue. Can you

Try to run the first request multiple times to see if this is either related to the second request being too large or the onnx model not deallocating memory in time for the 2nd TF request.
Try to run this script in a different machine and see if the issue persists?

Aug 03 '22 19:08 jbkyang-nvi

Hi, @jbkyang-nvi

could it be that the ONNX model does not deallocate the memory in time for the second TF request, how should I do to free it? I tried running this scripts on a different machine and it works correctly. Thanks!

Aug 04 '22 07:08 GabrieleFerrario

Hi, @jbkyang-nvi

could it be that the ONNX model does not deallocate the memory in time for the second TF request, how should I do to free it? I tried running this scripts on a different machine and it works correctly. Thanks!

@GabrieleFerrario you can try to isolate the issue by just running your script and not running your onnx model while providing the same request inputs? With respect to making onnx release the GPU memory, you can search around the onnx github to find the right answer for that

Aug 04 '22 23:08 jbkyang-nvi

Hi @jbkyang-nvi, ok I will try and let you know as soon as possible. Thanks for the support!

Aug 08 '22 20:08 GabrieleFerrario

Hi @jbkyang-nvi, ok I will try and let you know as soon as possible. Thanks for the support!

Hi, @GabrieleFerrario I have run into the same issue. Could you share how to solve this problem?

Sep 08 '22 22:09 zhanghaoie

Hi @zhanghaoie, In my case, the problem was that my NVIDIA device was full of processes that had to be running in parallel and I had to fix the handling of those processes. So as we said above you have to manage the resources of your device

Sep 10 '22 14:09 GabrieleFerrario

Closing this issue since it seems to be resolved

Sep 12 '22 17:09 jbkyang-nvi

After reviewing the previous discussion, it seemed that there were no solutions provided. However, upon further investigation, I discovered that the network_timeout parameter was what I was looking for. Adjusting this parameter successfully solved my issue.

Example:

triton_client = httpclient.InferenceServerClient(
    url='0.0.0.0:8000', 
    verbose=True, 
    network_timeout=600.0,  # this works when inference took too long (inference with cpu)
    connection_timeout=600.0 # this won't work
)

Oct 13 '23 03:10 tonywang10101