openfl icon indicating copy to clipboard operation
openfl copied to clipboard

CudaDeviceMonitor failed when defined inexistent gpu device index

Open dmitryagapov opened this issue 4 years ago • 0 comments

Steps to reproduce

  1. Add inexist gpu index to envoy_config.yaml
params:
  cuda_devices: [0,99] <=add index here
  1. run director
  2. run envoy
  3. run jupyterlab
  4. run experiment
  5. see logs on envoy

Expected behavior No exceptions on envoy

Actual behaviour

[14:13:44] ERROR    Failed to get cuda device info: Invalid Argument. Check your cuda device monitor plugin.                                        envoy.py:165
                    Traceback (most recent call last):                                                                                                          
                      File "/home/dmitry/code/openfl/openfl/component/envoy/envoy.py", line 149, in _get_cuda_device_info                                      
                        memory_total = self.cuda_device_monitor.get_device_memory_total(device_id)                                                              
                      File "/home/dmitry/code/openfl/openfl/plugins/processing_units_monitor/pynvml_monitor.py", line 29, in                                    
                    get_device_memory_total                                                                                                                    
                        handle = pynvml.nvmlDeviceGetHandleByIndex(index)                                                                                      
                      File "/home/dmitry/code/openfl/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 1655, in nvmlDeviceGetHandleByIndex                
                        _nvmlCheckReturn(ret)                                                                                                                  
                      File "/home/dmitry/code/openfl/venv/lib/python3.8/site-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn                            
                        raise NVMLError(ret)                                                                                                                    
                    pynvml.nvml.NVMLError_InvalidArgument: Invalid Argument  

dmitryagapov avatar Mar 17 '22 11:03 dmitryagapov