k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

the gpu is not registered when ECC is off

Open tingweiwu opened this issue 6 years ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

the gpu is not registered when ECC is off

2. Steps to reproduce the issue

the 5th card's ECC is off for unknown reason. image

at this time . I see the Allocatable in k8s of this node is 7

Capacity:
 cpu:                72
 ephemeral-storage:  1146196168Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             525490100Ki
 nvidia.com/gpu:     8
 pods:               110
 rdma/hca:           1k
Allocatable:
 cpu:                71500m
 ephemeral-storage:  1050965677560
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             508028852Ki
 nvidia.com/gpu:     7
 pods:               110
 rdma/hca:           1k

I test this simply demo.when ECC is off. the GPU card still usefull.

>>> import os
>>> os.environ["CUDA_VISIBLE_DEVICES"] = "5"
>>> import tensorflow as tf
>>> tf.test.gpu_device_name()

What are the considerations for not registering this card when the ECC is off?

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • [ ] The output of nvidia-smi -a on your host
  • [ ] Your docker configuration file (e.g: /etc/docker/daemon.json)
  • [ ] The k8s-device-plugin container logs
  • [ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • [ ] Docker version from docker version
  • [ ] Docker command, image and tag used
  • [ ] Kernel version from uname -a
  • [ ] Any relevant kernel output lines from dmesg
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
  • [ ] NVIDIA container library version from nvidia-container-cli -V
  • [ ] NVIDIA container library logs (see troubleshooting)

tingweiwu avatar Jul 19 '19 09:07 tingweiwu

@tingweiwu Sorry for the late response. Are you still facing this issue with the latest driver?

nvjmayo avatar Jul 20 '20 18:07 nvjmayo

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 29 '24 04:02 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Mar 31 '24 04:03 github-actions[bot]