Linux device detection fails when simple-framebuffer device exists
Describe the issue
When trying to import onnxruntime I get this error:
2025-12-09 22:30:45.792214368 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"
And as a result of the error, it seems I don't get Cuda as a possible runtime provider:
/usr/local/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:123: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'
When checking what this device is, it seems that this is not the nvidia gpu:
bash-4.4$ ls -l /sys/class/drm/card0/device
lrwxrwxrwx. 1 root root 0 Dec 9 21:25 /sys/class/drm/card0/device -> ../../../simple-framebuffer.0
bash-4.4$ cat /sys/class/drm/card0/device/device
cat: /sys/class/drm/card0/device/device: No such file or directory
bash-4.4$ cat /sys/class/drm/card0/device/vendor
cat: /sys/class/drm/card0/device/vendor: No such file or directory
But the card is here in card1:
bash-4.4$ cat /sys/class/drm/card1/device/vendor
0x10de
bash-4.4$ cat /sys/class/drm/card1/device/device
0x2237
Unfortunately, due to the way the device detection is written, if either of those files is missing, then GetGpuDeviceFromSysfs will return an error code https://github.com/microsoft/onnxruntime/blob/d1abad00eb2a173ab927f63da1891159d3682750/onnxruntime/core/platform/linux/device_discovery.cc#L124
When that happens, GetGpuDevices fully aborts processing, and never examines any further gpus: https://github.com/microsoft/onnxruntime/blob/d1abad00eb2a173ab927f63da1891159d3682750/onnxruntime/core/platform/linux/device_discovery.cc#L158
In order to fix, the code should still scan all further gpus even if a previous gpu had an error. I'm not a c++ developer, but a possible fix could be something like:
for (const auto& gpu_sysfs_path_info : gpu_sysfs_path_infos) {
OrtHardwareDevice gpu_device{};
Status _status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device));
if (_status.IsOK()) {
gpu_devices.emplace_back(std::move(gpu_device));
}
}
To reproduce
import onnxruntime on g5.4xlarge instance type.
Urgency
No response
Platform
Linux
OS Version
6.1.158-178.288.amzn2023.x86_64
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu = 1.23.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
cuda 12