mps server error Failed to start : invalid argument

Open aphrodite1028 opened this issue 1 year ago • 0 comments

1. Quick Debug Information

OS/Version:Centos Linux 8
Kernel Version: 4.18.0
Container Runtime Type/Version(Docker):20.10.24
K8s Version:v1.21.6
GPU:Tesla P4
GPU Driver Version:535.129.03

2. Issue or feature description

I use k8s-device-plugin 0.15.0 version to deploy in k8s and using a container run matrixMul get error

[Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)" and I find like msg using dmesg -T

Cannot map memory with base addr 0x2019c00000 and size of 0x200 pages

and mps-control-daemon log info is

[2024-04-29 11:22:10.421 Control    73] Starting new server 95 for user 0
[2024-04-29 11:22:10.425 Control    73] Accepting connection...
[2024-04-29 11:22:10.441 Control    73] Server encountered a fatal exception. Shutting down
[2024-04-29 11:22:10.446 Control    73] Server 95 exited with status 1
[2024-04-29 11:22:10.447 Control    73] Starting new server 98 for user 0

cuda-nvidia-mps-server log info like

Other 425] Startup Other 425] Connecting to control daemon on socket: /mps/nvidia.com/gpu.shared/pipe/control Other 425] Initializing server process Legacy Server 425] Failed to start : invalid argument rpm -qa |grep nvidia info is

libnvidia-container-tools-1.14.3-1.x86_64
libnvidia-container1-1.14.3-1.x86_64
nvidia-container-runtime-3.14.0-1.noarch
pcp-pmda-nvidia-gpu-5.0.2-5.el8.x86_64
nvidia-container-toolkit-1.14.3-1.x86_64
nvidia-container-toolkit-base-1.14.3-1.x86_64

Apr 29 '24 11:04 aphrodite1028