Volta MPS Server error "Receive command failed, assuming client exit. Client process disconnected"

Open amybachir opened this issue 4 years ago • 1 comments

I'm using aws-virtual-gpu-device-plugin which is a solution built on top of Multi-Process Service(MPS) to expose arbitrary number of virtual GPUs on GPU nodes in a kubernetes cluster.

Occasionally and randomly, the MPS server container nvidia/mps starts throwing the following error:

Size of /dev/shm: 16594362368 bytes
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: 10
Available GPUs:	- 0, 00000000:00:1E.0, Tesla T4, Exclusive_Process
Starting NVIDIA MPS control daemon...
[2021-12-19 16:22:21.847 Control     1] Starting control daemon using socket /tmp/nvidia-mps/control
[2021-12-19 16:22:21.847 Control     1] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
[2021-12-19 16:26:44.071 Control     1] Accepting connection...
[2021-12-19 16:26:44.071 Control     1] User did not send valid credentials
[2021-12-19 16:26:44.071 Control     1] Accepting connection...
[2021-12-19 16:26:44.071 Control     1] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2021-12-19 16:26:44.071 Control     1] Starting new server 17 for user 1000
[2021-12-19 16:26:44.075 Other    17] Start
[2021-12-19 16:26:44.076 Other    17] MPS Server connecting to control daemon on socket: /tmp/nvidia-mps/control
[2021-12-19 16:26:44.076 Control     1] Accepting connection...
[2021-12-19 16:26:44.114 Other    17] Volta MPS: Creating server context on device 0
[2021-12-19 16:26:44.183 Control     1] NEW SERVER 17: Ready
[2021-12-19 16:26:44.183 Other    17] Active Threads Percentage set to 10.0
[2021-12-19 16:26:44.183 Other    17] MPS Server is started
[2021-12-19 16:26:44.183 Other    17] Volta MPS Server: Received new client request
[2021-12-19 16:26:44.183 Other    17] MPS Server: worker created
[2021-12-19 16:26:44.183 Other    17] Volta MPS: Creating worker thread
[2021-12-19 16:26:55.862 Control     1] Accepting connection...
[2021-12-19 16:26:55.862 Control     1] NEW CLIENT 0 from user 1000: Server already exists
[2021-12-19 16:26:55.863 Other    17] Volta MPS Server: Received new client request
[2021-12-19 16:26:55.863 Other    17] MPS Server: worker created
[2021-12-19 16:26:55.863 Other    17] Volta MPS: Creating worker thread
[2021-12-19 16:26:55.863 Other    17] Volta MPS: Device Tesla T4 (uuid 0xc45844d0-0x8f56d7fe-0xc10cdf37-0xed721b20) is associated
[2022-01-04 09:54:19.995 Other    17] Receive command failed, assuming client exit
[2022-01-04 09:54:19.995 Other    17] Volta MPS: Client disconnected. Number of active client contexts is 0.
[2022-01-04 09:54:20.011 Other    17] Receive command failed, assuming client exit
[2022-01-04 09:54:20.012 Other    17] Volta MPS: Client process disconnected

The client application has the following error in its log at the same time:

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

The issues consistently gets fixed when I restart both the mps server and the client app but I haven't been able to identify the root cause or figure out a permanent solution.

In the client app, I'm using nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04 The MPS server container is nvidia/mps:latest

Jan 24 '22 16:01 amybachir

Is there other GPU process in your GPU device before you get the error? @amybachir

May 19 '22 05:05 austingg