Linux device detection fails when simple-framebuffer device exists

Open azhao12345 opened this issue 2 months ago • 0 comments

Describe the issue

When trying to import onnxruntime I get this error:

2025-12-09 22:30:45.792214368 [W:onnxruntime:Default, device_discovery.cc:164 DiscoverDevicesForPlatform] GPU device discovery failed: device_discovery.cc:89 ReadFileContents Failed to open file: "/sys/class/drm/card0/device/vendor"

And as a result of the error, it seems I don't get Cuda as a possible runtime provider:

/usr/local/lib/python3.12/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:123: UserWarning: Specified provider 'CUDAExecutionProvider' is not in available provider names.Available providers: 'AzureExecutionProvider, CPUExecutionProvider'

When checking what this device is, it seems that this is not the nvidia gpu:

bash-4.4$ ls -l /sys/class/drm/card0/device
lrwxrwxrwx. 1 root root 0 Dec  9 21:25 /sys/class/drm/card0/device -> ../../../simple-framebuffer.0
bash-4.4$ cat /sys/class/drm/card0/device/device
cat: /sys/class/drm/card0/device/device: No such file or directory
bash-4.4$ cat /sys/class/drm/card0/device/vendor
cat: /sys/class/drm/card0/device/vendor: No such file or directory

But the card is here in card1:

bash-4.4$ cat /sys/class/drm/card1/device/vendor
0x10de
bash-4.4$ cat /sys/class/drm/card1/device/device
0x2237

Unfortunately, due to the way the device detection is written, if either of those files is missing, then GetGpuDeviceFromSysfs will return an error code https://github.com/microsoft/onnxruntime/blob/d1abad00eb2a173ab927f63da1891159d3682750/onnxruntime/core/platform/linux/device_discovery.cc#L124

When that happens, GetGpuDevices fully aborts processing, and never examines any further gpus: https://github.com/microsoft/onnxruntime/blob/d1abad00eb2a173ab927f63da1891159d3682750/onnxruntime/core/platform/linux/device_discovery.cc#L158

In order to fix, the code should still scan all further gpus even if a previous gpu had an error. I'm not a c++ developer, but a possible fix could be something like:

  for (const auto& gpu_sysfs_path_info : gpu_sysfs_path_infos) {
    OrtHardwareDevice gpu_device{};
    Status _status = GetGpuDeviceFromSysfs(gpu_sysfs_path_info, gpu_device));
    if (_status.IsOK()) {
        gpu_devices.emplace_back(std::move(gpu_device));
    }
  }

To reproduce

import onnxruntime on g5.4xlarge instance type.

Urgency

No response

Platform

Linux

OS Version

6.1.158-178.288.amzn2023.x86_64

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu = 1.23.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 12

Dec 09 '25 22:12 azhao12345