k8s-device-plugin
k8s-device-plugin copied to clipboard
mps server error Failed to start : invalid argument
1. Quick Debug Information
- OS/Version:Centos Linux 8
- Kernel Version: 4.18.0
- Container Runtime Type/Version(Docker):20.10.24
- K8s Version:v1.21.6
- GPU:Tesla P4
- GPU Driver Version:535.129.03
2. Issue or feature description
I use k8s-device-plugin 0.15.0 version to deploy in k8s and using a container run matrixMul get error
[Matrix Multiply Using CUDA] - Starting... CUDA error at ../../common/inc/helper_cuda.h:708 code=30(cudaErrorUnknown) "cudaGetDeviceCount(&device_count)"
and I find like msg using dmesg -T
Cannot map memory with base addr 0x2019c00000 and size of 0x200 pages
and mps-control-daemon log info is
[2024-04-29 11:22:10.421 Control 73] Starting new server 95 for user 0
[2024-04-29 11:22:10.425 Control 73] Accepting connection...
[2024-04-29 11:22:10.441 Control 73] Server encountered a fatal exception. Shutting down
[2024-04-29 11:22:10.446 Control 73] Server 95 exited with status 1
[2024-04-29 11:22:10.447 Control 73] Starting new server 98 for user 0
cuda-nvidia-mps-server log info like
Other 425] Startup Other 425] Connecting to control daemon on socket: /mps/nvidia.com/gpu.shared/pipe/control Other 425] Initializing server process Legacy Server 425] Failed to start : invalid argument
rpm -qa |grep nvidia info is
libnvidia-container-tools-1.14.3-1.x86_64
libnvidia-container1-1.14.3-1.x86_64
nvidia-container-runtime-3.14.0-1.noarch
pcp-pmda-nvidia-gpu-5.0.2-5.el8.x86_64
nvidia-container-toolkit-1.14.3-1.x86_64
nvidia-container-toolkit-base-1.14.3-1.x86_64