RmCheckForExternalGpu does not always identify an eGPU correctly

Open artlav opened this issue 7 months ago • 0 comments

NVIDIA Open GPU Kernel Modules Version

575.64

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

[x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch linux

Kernel Release

Linux frx 6.15.4-arch2-1 #1 SMP PREEMPT_DYNAMIC Fri, 27 Jun 2025 16:35:07 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

[x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 5070 Ti

Describe the bug

I got a 5070 Ti in an ADTLINK UT4G dock on a Framework 16 laptop.

Trying to safely remove it (i.e. "echo 0000:09:00.0 >/sys/bus/pci/drivers/nvidia/unbind") produces a "NVRM: Attempting to remove device with non-zero usage count!" error. Just unplugging it produces a flurry of assertion errors. In both cases the driver then locks up, requiring a hard power cycle to use the laptop again.

Looking at the code, the only way that error could show up is if nv->is_external_gpu is not set, in nv_pci_remove in nvidia/nv-pci.c. The flag is set based on the output of RmCheckForExternalGpu, in src/nvidia/arch/nvalloc/unix/src/osinit.c Clearly, it does not detect the GPU as an eGPU on my hardware.

I am not familiar with the details of it enough to figure out why, but as an experiment i patched RmCheckForExternalGpu to always return NV_TRUE, and that solved the issue. (As in, the kernel no longer locks up. Full hotplug is still broken for unrelated reasons, see #842 for example)

To Reproduce

Plug in an eGPU of the sort it can't identify as such.

Try to remove it or unplug it.

The module is now locked up with 100% CPU usage on one core.

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

No response

Jul 01 '25 12:07 artlav