Volta MPS Server error "Receive command failed, assuming client exit. Client process disconnected"
I'm using aws-virtual-gpu-device-plugin which is a solution built on top of Multi-Process Service(MPS) to expose arbitrary number of virtual GPUs on GPU nodes in a kubernetes cluster.
Occasionally and randomly, the MPS server container nvidia/mps starts throwing the following error:
Size of /dev/shm: 16594362368 bytes
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: 10
Available GPUs: - 0, 00000000:00:1E.0, Tesla T4, Exclusive_Process
Starting NVIDIA MPS control daemon...
[2021-12-19 16:22:21.847 Control 1] Starting control daemon using socket /tmp/nvidia-mps/control
[2021-12-19 16:22:21.847 Control 1] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
[2021-12-19 16:26:44.071 Control 1] Accepting connection...
[2021-12-19 16:26:44.071 Control 1] User did not send valid credentials
[2021-12-19 16:26:44.071 Control 1] Accepting connection...
[2021-12-19 16:26:44.071 Control 1] NEW CLIENT 0 from user 1000: Server is not ready, push client to pending list
[2021-12-19 16:26:44.071 Control 1] Starting new server 17 for user 1000
[2021-12-19 16:26:44.075 Other 17] Start
[2021-12-19 16:26:44.076 Other 17] MPS Server connecting to control daemon on socket: /tmp/nvidia-mps/control
[2021-12-19 16:26:44.076 Control 1] Accepting connection...
[2021-12-19 16:26:44.114 Other 17] Volta MPS: Creating server context on device 0
[2021-12-19 16:26:44.183 Control 1] NEW SERVER 17: Ready
[2021-12-19 16:26:44.183 Other 17] Active Threads Percentage set to 10.0
[2021-12-19 16:26:44.183 Other 17] MPS Server is started
[2021-12-19 16:26:44.183 Other 17] Volta MPS Server: Received new client request
[2021-12-19 16:26:44.183 Other 17] MPS Server: worker created
[2021-12-19 16:26:44.183 Other 17] Volta MPS: Creating worker thread
[2021-12-19 16:26:55.862 Control 1] Accepting connection...
[2021-12-19 16:26:55.862 Control 1] NEW CLIENT 0 from user 1000: Server already exists
[2021-12-19 16:26:55.863 Other 17] Volta MPS Server: Received new client request
[2021-12-19 16:26:55.863 Other 17] MPS Server: worker created
[2021-12-19 16:26:55.863 Other 17] Volta MPS: Creating worker thread
[2021-12-19 16:26:55.863 Other 17] Volta MPS: Device Tesla T4 (uuid 0xc45844d0-0x8f56d7fe-0xc10cdf37-0xed721b20) is associated
[2022-01-04 09:54:19.995 Other 17] Receive command failed, assuming client exit
[2022-01-04 09:54:19.995 Other 17] Volta MPS: Client disconnected. Number of active client contexts is 0.
[2022-01-04 09:54:20.011 Other 17] Receive command failed, assuming client exit
[2022-01-04 09:54:20.012 Other 17] Volta MPS: Client process disconnected
The client application has the following error in its log at the same time:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
The issues consistently gets fixed when I restart both the mps server and the client app but I haven't been able to identify the root cause or figure out a permanent solution.
In the client app, I'm using nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu18.04
The MPS server container is nvidia/mps:latest
Is there other GPU process in your GPU device before you get the error? @amybachir