Prakash Chandra comments

Results 15 comments of


                                            Prakash Chandra

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

Hi Team Why are my pods getting multiprocessor count as 4 instead of 40? issue observed after upgrading our Amazon EKS cluster to version 1.26. We are utilizing NVIDIA Tesla...

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

@elezar I reduced the count to 2 **sharing: mps: resources: - name: nvidia.com/gpu replicas: 2** Then I am getting the count as 20 ![image](https://github.com/NVIDIA/k8s-device-plugin/assets/95615399/ccb813c5-0a82-4c41-8e25-e305b64d0fc1) But I want SM to be...

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

@elezar I want to run multiple pods(approx 8) on 1 GPU. So I am using MPS for that purpose. I understand your answer, that will provide me with 40 SM...

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

@klueska I want full memory and compute access across all pods. I have g4dn.2xlarge instance with the following config ![image](https://github.com/NVIDIA/k8s-device-plugin/assets/95615399/71d8e68e-6191-4147-847f-93e3dc7db46c) I want my 8 workloads to access the memory as...

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

@elezar Could you please help here. I am not able to configure the MPS sharing option here is the output kubectl logs nvidia-device-plugin-daemonset-4p742 -c mps-control-daemon-ctr -n kube-system I0517 04:07:02.596152 1...

MPS use error: Failed to allocate device vector A (error code all CUDA-capable devices are busy or unavailable)!

@channel Could anyone please give some advice here?

Errors in nv-hostengine log

I am also facing the same issue where the logs are in error state. When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the...

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string

@elezar I am using version `0.15.0` I need to set replicas to 1 so that I can have full resource access of the GPU node. My config looks like this...

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string

@klueska I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows ![image](https://github.com/NVIDIA/k8s-device-plugin/assets/95615399/453d945e-6d9a-407a-920b-f314b9c4bb06) In order to have my workloads/pods get scheduled over it, I created the daemonset...

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string

@elezar @klueska Although thing didn't work from Helm configuration I was able to figure out the solution. I tweaked the value for `CUDA_MPS_ACTIVE_THREAD_PERCENTAGE` to 100 so that my full GPU...