openfl
openfl copied to clipboard
CudaDeviceMonitor failed when defined inexistent gpu device index
Steps to reproduce
- Add inexist gpu index to envoy_config.yaml
params:
cuda_devices: [0,99] <=add index here
- run director
- run envoy
- run jupyterlab
- run experiment
- see logs on envoy
Expected behavior No exceptions on envoy
Actual behaviour
[14:13:44] ERROR Failed to get cuda device info: Invalid Argument. Check your cuda device monitor plugin. envoy.py:165
Traceback (most recent call last):
File "/home/dmitry/code/openfl/openfl/component/envoy/envoy.py", line 149, in _get_cuda_device_info
memory_total = self.cuda_device_monitor.get_device_memory_total(device_id)
File "/home/dmitry/code/openfl/openfl/plugins/processing_units_monitor/pynvml_monitor.py", line 29, in
get_device_memory_total
handle = pynvml.nvmlDeviceGetHandleByIndex(index)
File "/home/dmitry/code/openfl/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 1655, in nvmlDeviceGetHandleByIndex
_nvmlCheckReturn(ret)
File "/home/dmitry/code/openfl/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_InvalidArgument: Invalid Argument