open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

add support of thunderbolt hotplug

Open bdandy opened this issue 9 months ago • 1 comments

NVIDIA Open GPU Kernel Modules Version

NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 570.144 Release Build (notroot@4edfd97358ec) Fri Apr 25 18:29:08 UTC 2025

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

CachyOS latest

Kernel Release

Linux cachyos 6.14.4-2-cachyos #1 SMP PREEMPT_DYNAMIC Fri, 25 Apr 2025 18:15:23 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-0fded5d6-1fad-c667-3302-31e117ab858a)

Describe the bug

I'm using dGPU with thunderbolt 3 case. The main issue that during sleep/suspend or reconnecting device kernel module goes to "fallen off the bus" mode, without reconnecting. Any usage of device is not possible until reboot.

Hotplug works on Windows btw.

To Reproduce

Disconnect dGPU via thunderbolt and connect again

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

bdandy avatar Apr 29 '25 16:04 bdandy

Can confirm that this doesn't happen with the proprietary driver. Got a 3070 in an "Intel Tamales Module 2" type eGPU case, and i can hot plug and unplug it without any issues with it.

With the open driver, it does the "NVRM: Xid (PCI:0000:09:00): 79, GPU has fallen off the bus." thing on unplug, with a bunch of

[   54.962032] NVRM: _issueRpcAndWait: rpcSendMessage failed with status 0x0000000f for fn 10!
[   54.962034] NVRM: rpcRmApiFree_GSP: GspRmFree failed: hClient=0xc1d00010; hObject=0xbeef502d; paramsStatus=0x00000000; status=0x0000000f
[   54.962036] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_client.c:843
[   54.962039] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:259
[   54.962042] NVRM: nvAssertFailedNoLog: Assertion failed: status == NV_OK @ rs_server.c:1375

in dmesg.

After that, replugging it does not work (it's detected and nvidia-smi and nvtop works, but nothing runs on it). Trying to run, say, vkcube, gives these errors:

[  119.470389] NVRM: nvAssertFailedNoLog: Assertion failed: _serverGetClientEntryByHandle(pServer, hClient2, 0, CLIENT_LIST_LOCK_UNLOCKED, ppClientEntry2nd) @ rs_server.c:3467
[  119.470401] NVRM: RmExportObject: pRmApi->DupObject(Dev, failed with error code 0x33 in RmExportObject

Also the system won't go into suspend and requires a reboot.

What makes this a more insidious problem than just hot plugging is that the link on these eGPU boxes can be a bit flakey under load. The proprietary driver would just roll over a link upset, with maybe a hiccup in game. Meanwhile, the open driver fails completely as if it got unplugged.

I'm on Framework 16 laptop with arch linux, for context. Linux frx 6.14.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 09 May 2025 17:36:18 +0000 x86_64 GNU/Linux

artlav avatar May 12 '25 05:05 artlav