6.12: drm_open_helper RIP
NVIDIA Open GPU Kernel Modules Version
ed4be649623435ebb04f5e93f859bf46d977daa4
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
CachyOS (ArchLinux)
Kernel Release
6.12.0rc1
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [ ] I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 4070 SUPER (UUID: GPU-8c5baf85-cb1f-fe26-95d5-ff3fd51249bb)
Describe the bug
Since the 6.12.0rc1 Release the kernel drm-helper is crashing with the 560.35.03 drivers.
Following patches were pulled in, to make the driver compatible with 6.12, these were extracted out of the 550.120 release: drm_fbdev fixup for 6.11+: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0004-6.11-Add-fix-for-fbdev.patch drm_outpull_pill for 6.12: https://github.com/CachyOS/kernel-patches/blob/master/6.12/misc/nvidia/0005-6.12-drm_outpull_pill-changed-check.patch
Additional patch to make the module compilation happy (Introduced in commit https://github.com/torvalds/linux/commit/32f51ead3d7771cdec29f75e08d50a76d2c6253d ):
diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
continue;
}
- if (PageSwapCache(src_page)) {
+ if (folio_test_swapcache(page_folio(src_page))) {
// TODO: Bug 4050579: Remove this when swap cached pages can be
// migrated.
status = NV_WARN_MISMATCHED_TARGET;
with these patches the DKMS Compilation is successful and the driver works fine with the 6.11.x kernel.
Booting into 6.12.0rc1 results into that the driver crashes, at drm_open_helper and there is graphical interface available anymore. The tty is working fine. Following is visible in the dmesg log:
[ 5.090174] Console: switching to colour frame buffer device 240x67
[ 5.090176] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
[ 5.096243] ------------[ cut here ]------------
[ 5.096244] WARNING: CPU: 0 PID: 453 at drivers/gpu/drm/drm_file.c:312 drm_open_helper+0x135/0x150
[ 5.096249] Modules linked in: nvidia_uvm(OE) nvidia_drm(OE) drm_ttm_helper btrfs ttm blake2b_generic nvidia_modeset(OE) libcrc32c crc32c_generic xor hid_generic raid6_pq nvme nvme_core crc32c_intel video sha256_ssse3 usbhid nvme_auth wmi nvidia(OE)
[ 5.096255] CPU: 0 UID: 0 PID: 453 Comm: plymouthd Tainted: G OE 6.12.0-rc1-1-cachyos-rc #1 12df37afa12b373ced2670803975698fbda2ce5d
[ 5.096257] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 5.096257] Hardware name: ASRock X670E Pro RS/X670E Pro RS, BIOS 3.08 09/18/2024
[ 5.096258] RIP: 0010:drm_open_helper+0x135/0x150
[ 5.096259] Code: 5d 41 5c c3 cc cc cc cc 48 89 df e8 c5 82 fe ff 85 c0 0f 84 7a ff ff ff 48 89 df 89 44 24 0c e8 c1 f9 ff ff 8b 44 24 0c eb d1 <0f> 0b b8 ea ff ff ff eb c8 b8 ea ff ff ff eb c1 b8 f0 ff ff ff eb
[ 5.096260] RSP: 0018:ffffa643409ffb20 EFLAGS: 00010246
[ 5.096261] RAX: ffffffffc15df380 RBX: ffff89f744740f28 RCX: 0000000000000000
[ 5.096262] RDX: ffff89f755ee0000 RSI: ffff89f744740f28 RDI: ffff89f74df1cd80
[ 5.096262] RBP: ffff89f74df1cd80 R08: 0000000000000006 R09: ffff89f740213cd0
[ 5.096263] R10: 00000000000000e2 R11: 0000000000000002 R12: ffff89f75735a000
[ 5.096263] R13: ffffffffc15df380 R14: 00000000ffffffed R15: ffffa643409ffe1c
[ 5.096264] FS: 00007f6b595ce480(0000) GS:ffff8a065ce00000(0000) knlGS:0000000000000000
[ 5.096264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.096265] CR2: 000055da04c46558 CR3: 000000010d18c000 CR4: 0000000000f50ef0
[ 5.096265] PKRU: 55555554
[ 5.096266] Call Trace:
[ 5.096267] <TASK>
[ 5.096267] ? drm_open_helper+0x135/0x150
[ 5.096268] ? __warn.cold+0xad/0x116
[ 5.096270] ? drm_open_helper+0x135/0x150
[ 5.096272] ? report_bug+0x127/0x170
[ 5.096273] ? handle_bug+0x58/0x90
[ 5.096275] ? exc_invalid_op+0x1b/0x80
[ 5.096276] ? asm_exc_invalid_op+0x1a/0x20
[ 5.096279] ? drm_open_helper+0x135/0x150
[ 5.096279] drm_open+0x81/0x110
[ 5.096280] drm_stub_open+0xaf/0x100
[ 5.096282] chrdev_open+0xc5/0x260
[ 5.096285] ? __pfx_chrdev_open+0x10/0x10
[ 5.096286] do_dentry_open+0x14b/0x490
[ 5.096287] vfs_open+0x30/0xe0
[ 5.096289] path_openat+0x84d/0x1320
[ 5.096290] ? __alloc_pages_noprof+0x183/0x350
[ 5.096292] do_filp_open+0xd2/0x180
[ 5.096293] do_sys_openat2+0xca/0x100
[ 5.096294] __x64_sys_openat+0x55/0xa0
[ 5.096295] do_syscall_64+0x82/0x190
[ 5.096296] ? handle_mm_fault+0x1d9/0x2e0
[ 5.096297] ? do_user_addr_fault+0x38d/0x6c0
[ 5.096299] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 5.096300] RIP: 0033:0x7f6b59899ae5
[ 5.096301] Code: 75 53 89 f0 f7 d0 a9 00 00 41 00 74 48 80 3d d1 b5 0d 00 00 74 6c 45 89 e2 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 8f 00 00 00 48 8b 54 24 28 64 48 2b 14 25
[ 5.096302] RSP: 002b:00007fffbdc08760 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
[ 5.096303] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f6b59899ae5
[ 5.096303] RDX: 0000000000000002 RSI: 000055da04c42a40 RDI: 00000000ffffff9c
[ 5.096303] RBP: 000055da04c42a40 R08: 0000000000000000 R09: 0000000000000007
[ 5.096304] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 5.096304] R13: 00007f6b599a1a50 R14: 000000000000000b R15: 000055da04c43e30
[ 5.096305] </TASK>
[ 5.096305] ---[ end trace 0000000000000000 ]---
[ 5.173332] systemd-journald[355]: Received SIGTERM from PID 1 (systemd).
To Reproduce
- Compile 6.12.0.rc1 Kernel
- Apply above mentioned patches on 560.35.03
- Compile the Module and boot into
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response
What happens if you revert that kernel change made by upstream. Made the drivers compile without additional patches: What happens if you revert that change in kernel. That is what I did before: https://gitlab.manjaro.org/packages/core/linux612/-/blob/ec1f53f77fd3f92f7cd4eeed444a341d8ded3291/revert-nvidia-446d0f48.patch
Thanks! Tracked internally as NV bug 4888621.
This may be related to commit 641bb4394f40 ("fs: move FMODE_UNSIGNED_OFFSET to fop_flags"). At least for nvidia-470xx it's fixed by adding the .fop_flags = FOP_UNSIGNED_OFFSET line from this patch. Though for me the kernel didn't full crash, just fail to detect the adapters correctly.
@joanbm It seems this patch does work and I got properly on 6.12 into the kernel. There was one more patch required to have a succesful dkms compilation, due upstream changes:
diff --git a/kernel-open/nvidia-uvm/uvm_hmm.c b/kernel-open/nvidia-uvm/uvm_hmm.c
index 93e64424..dc64184e 100644
--- a/kernel-open/nvidia-uvm/uvm_hmm.c
+++ b/kernel-open/nvidia-uvm/uvm_hmm.c
@@ -2694,7 +2694,7 @@ static NV_STATUS dmamap_src_sysmem_pages(uvm_va_block_t *va_block,
continue;
}
- if (PageSwapCache(src_page)) {
+ if (folio_test_swapcache(page_folio(src_page))) {
// TODO: Bug 4050579: Remove this when swap cached pages can be
// migrated.
status = NV_WARN_MISMATCHED_TARGET;
Commit: https://github.com/CachyOS/CachyOS-PKGBUILDS/commit/3352d048906d755e6b49d0eee5bb86766db99bd2
- https://forums.developer.nvidia.com/t/patch-for-565-57-01-linux-kernel-6-12/313260
- https://github.com/CachyOS/CachyOS-PKGBUILDS/issues/417
@Binary-Eater
The bug hit production for me, and the internal laptop display suddenly stopped working.
If I detach the computer from HDMI, then I have no screen.
This means that the bug is critical, as it renders the system unusable.
But more importantly, this exposes a big flaw on how critical bugs are handled and prevented, project management wise.
Critical bugs like these need to be visually separated from the rest, for example by using a tag. And while they are present, all efforts shall concentrate on fixing them before coding anything else.
It cannot happen that adding two lines of code, that someone else coded for you, for a critical bug, takes 2 months.
The bug hit production for me, and the internal laptop display suddenly stopped working.
If referring to that patch, it's already included (well a slightly different version of) in the production branch of the drivers, aka 550.135 released November 20 (beta drivers like 565.57.01, are well, betas and may not see immediate fixes -- generally would also avoid using brand new kernel branches to give time unless ok with being the tester -- ideally use long-term-support branches).
@ionenwks Okay, this is what has happened:
I visited the downloads repo and opened latest.txt.
This document said "550.135", yet on the repo there was three major versions after that: 555, 560, 565.
Hence I assumed that maybe "latest.txt" was outdated or something. Probably a small improvement would be naming that file more explicitly, like "stable.txt".
I also see that versions are explained on a different page, here. Maybe this info needs to be in the repo itself, in the form of text files. Like "1-stable.txt", "2-new-features.txt", and "3-experimental.txt". Or as a reference to that page, or both.
But if you ask me, I wouldn't have multiple "flavors" of the driver. Instead I would adopt the model "stable base with some beta on it". Release often, and only modify code further when known bugs have been addressed or decided, aka the new small beta code is no longer beta.
Next I don't see this patch applied on the 550 branch code. Hence it seems I had received the critical bug anyway.
Finally using older code, like an older kernel, counterintuitively usually leads to less stability. When you wait too long to update bugs accumulate, and you have to deal with all hitting at the same time. When you update more often, you can handle the bugs more like "drop by drop".
@es20490446e for now only 550 series seems to support 6.12 kinda. Older Nvidia drivers or 560 (gets no updates anymore) and the latest 565 are not yet updated, but patched by the user community, especially around Arch-based distros. Adding patches might not be the problem here, however to verify if those don't create regressions and are all tested against a wide range of Nvidia hardware. QA normally takes time, especially with out-of-tree kernel drivers.
It is also known that Nvidia doesn't accept issues from unreleased kernels, such as RC-Kernels. Normally it is critical to look at least at RC1 releases and provide patches as soon as possible to the users for testing during the development cycle. When AMD switched to FOSS and in-tree drivers, a better support was given. Also their kernel developers are very active at discord and other public social media channels to work with the user and developer community for fast fixing issues. Maybe Nvidia might want to change policies and adopt one or two things other companies do.
Latest example of needed code changes in their open drivers for 6.13 can be seen here: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/746
Also it seems that modesetting needs to be active with 6.12+ kernels. Not all drivers default to that setting yet.
@philmmanjaro Oki, thanks for the info 👍
OpenSUSE just rolled out kernel 6.12 and this bug is still in effect when using the drivers from the cuda repository maintained by nvidia. I'll update if I find out how/if it's intended to work
OpenSUSE just rolled out kernel 6.12 and this bug is still in effect when using the drivers from the cuda repository maintained by nvidia. I'll update if I find out how/if it's intended to work
@pallaswept I have the same issue. If you find a solution, please share it. Thank you!
Same issue for me on Tumbleweed with 6.12 kernel using the cuda 565 drivers.
It seems this also affects closed source drivers?
I zypper dup'ed yesterday to tumbleweed 20241226 using linux kernel 6.12 and now have a black screen too. Can boot to older kernel but then dmesg complains about incompatible versions - still black screen.
[ 10.800842] [ T1343] NVRM: API mismatch: the client has the version 550.100, but
NVRM: this kernel module has the version 550.90.07. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
How would I downgrade "the client"? All packages I see regarding v550 is
# rpm -qa | fgrep 550.
nvidia-video-G06-32bit-550.100-25.1.x86_64
nvidia-compute-G06-32bit-550.100-25.1.x86_64
nvidia-utils-G06-550.100-25.1.x86_64
nvidia-video-G06-550.100-25.1.x86_64
nvidia-driver-G06-kmp-default-550.100_k6.9.7_1-25.1.x86_64
nvidia-compute-G06-550.100-25.1.x86_64
nvidia-drivers-G06-550.100-25.1.x86_64
nvidia-gl-G06-32bit-550.100-25.1.x86_64
nvidia-gl-G06-550.100-25.1.x86_64
nvidia-compute-utils-G06-550.100-25.1.x86_64
I did a dracut --regenerate-all --force although that was never needed before, but still nvidia.ko in all modules dirs is 550.90.07. Building 550.100 obviously failed, but why do I have nvidia-driver-G06-kmp-default-550.100_k6.9.7_1-25.1.x86_64 then?
I'm willing to change to the open source driver once I find my 1650 SUPER is supported, but it seems it won't help?
this kernel module has the version 550.90.07 nvidia-driver-G06-kmp-default-550.100
The bad news is, your PC is schizophrenic 😆 The good news is, that's not this bug, and you're happy using 550 drivers which are supported, so you can probably fix it.
This should probably be discussed elsewhere (the opensuse forums or reddit or something). Maybe start by taking a look at zypper se -iv nvidia and make sure your packages are coming from the correct repository/vendor.
GL mate.
Thanks, let me summarize the current status before I go:
found that besides the 37489 build warnings of the nvidia module I somehow missed the one error that dracut also didn't bother to complain about but silently used the old 550.90 module:
/usr/src/kernel-modules/nvidia-550.100-default/nvidia-drm/nvidia-drm-drv.c:207:6: error: ‘const struct drm_mode_config_funcs’ has no member named ‘output_poll_changed’
ugly workaround to inject the fix (member is not used) in the build until tumbleweed releases a proper fix:
- in one terminal start building the module, e.g. with
zypper in -f -y -l nvidia-driver-G06-kmp-default - in another terminal edit the file once it appears, e.g. with
sed -i '[email protected]_poll_changed =@// .output_poll_changed =@' /usr/src/kernel-modules/nvidia-550.100-default/nvidia-drm/nvidia-drm-drv.c
Is there any workaround to be able to use internal laptop monitor with full nvidia graphics? No nvidia drivers will populate on my laptop screen any longer.
@username227 would you mind sharing the logs collected by the nvidia-bug-report.sh script?
@username227 would you mind sharing the logs collected by the nvidia-bug-report.sh script?
Attached. Generated while running in hybrid mode since nvidia mode gives only black screen.
A few years ago i made a big mistake. I replace my trusty and reliable high-end GPU with a nvida card for very wrong reasons, to be compatible with some work software during lock down.
Since then regularly troubles. Now since the 6.12 kernels no dual screen setup anymore.
Today is a good day, the only nvidia product i've ever bought (and ever will buy) is going to be replaced and will end up in the trash !
Too much issues, too much troubles.
I'll update if I find out how/if it's intended to work
OpenSUSE users, the necessary changes have been applied, and we should be able to have a working 6.12+cuda system now.
There is a change - we need to change to use the kernel module package from openSUSE's own repository, rather than the Nvidia cuda repository.
This should be fairly simple, just one command:
sudo zypper in nvidia-open-driver-G06-signed-cuda-kmp-default
It will generate an error message that this new package conflicts with the old one, and if you have the nvidia-open meta package installed, it will also conflict with that. Confirm that you wish to use the new package and remove the old ones.
@bmwGTR @TerohsLab
I have the cuda-drivers meta package from the cuda repo's installed.
Also given the naming convention of nvidia-open-driver-G06-signed-cuda-kmp-default its not updated for 6.12.10 yet.
@pallaswept
via https://bugzilla.suse.com/show_bug.cgi?id=1234914
For now, I'd suggest to not use cuda-drivers at all when using open source drivers.
So, I think it's best to let it remove that package, too.
I had the same concern with the versioning, but I can confirm it is definitely working. A snippet from my terminal just now for your reference:
> zypper search -i --details nvidia
Loading repository data...
Reading installed packages...
S | Name | Type | Version | Arch | Repository
---+------------------------------------------------+---------+-------------------------+--------+------------------
i+ | kernel-firmware-nvidia | package | 20250111-1.1 | noarch | repo-oss
i+ | kernel-firmware-nvidia-gspx-G06 | package | 550.135-1.1 | x86_64 | (System Packages)
i+ | kernel-firmware-nvidia-gspx-G06 | package | 550.127.05-1.1 | x86_64 | (System Packages)
i+ | kernel-firmware-nvidia-gspx-G06 | package | 550.120-2.1 | x86_64 | (System Packages)
i+ | kernel-firmware-nvidia-gspx-G06 | package | 565.57.01-1 | x86_64 | cuda
i+ | kernel-firmware-nvidia-gspx-G06 | package | 560.35.03-0 | x86_64 | cuda
i+ | kernel-firmware-nvidia-gspx-G06 | package | 550.142-1.1 | x86_64 | repo-oss
i | libnvidia-container-tools | package | 1.17.3-1 | x86_64 | cuda
i | libnvidia-container1 | package | 1.17.3-1 | x86_64 | cuda
i+ | nvidia-compute-G06 | package | 565.57.01-1 | x86_64 | cuda
i | nvidia-compute-G06-32bit | package | 565.57.01-1 | x86_64 | cuda
i | nvidia-compute-utils-G06 | package | 565.57.01-1 | x86_64 | cuda
i+ | nvidia-container-toolkit | package | 1.17.3-1 | x86_64 | cuda
i | nvidia-container-toolkit-base | package | 1.17.3-1 | x86_64 | cuda
i+ | nvidia-drivers-G06 | package | 565.57.01-1 | x86_64 | cuda
i+ | nvidia-gl-G06 | package | 565.57.01-1 | x86_64 | cuda
i | nvidia-gl-G06-32bit | package | 565.57.01-1 | x86_64 | cuda
i+ | nvidia-libXNVCtrl | package | 565.77-1.2 | x86_64 | repo-oss
i+ | nvidia-open-driver-G06-signed-cuda-kmp-default | package | 565.57.01_k6.12.9_1-1.3 | x86_64 | repo-oss
i | nvidia-utils-G06 | package | 565.57.01-1 | x86_64 | cuda
i+ | nvidia-video-G06 | package | 565.57.01-1 | x86_64 | cuda
i | nvidia-video-G06-32bit | package | 565.57.01-1 | x86_64 | cuda
> uname -r
6.12.10-1-default
> nvidia-smi
Wed Jan 22 23:02:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
Hope that helps. If there are still problems, we should probably take it to opensuse-specific forums, since the bug referenced here is solved for us now. I just wanted to post here to let you know.
Since driver 570 available in cuda this is fixed.
You only need the meta package "nvidia-open" from cuda now which includes a patched kmp which works for 6.12 and 6.13 kernels.
We can remove the one from suse now @pallaswept
Since driver 570 available in cuda this is fixed.
Well, this was fixed prior to that, in the patched 565.57 driver.
With the new driver version 570, the nvidia-open metapackage works for open drivers, and cuda-drivers for closed drivers - the situation I mentioned above holds true for the driver prior to that, for 6.12 as referenced in this thread, but isn't current any more with 6.13/570
As for 570... You've missed a lot of this show. That driver was added, then removed, along with this 565.57 driver and other previous drivers (from the repo but not from the server, still, our systems broke), then the entire repository taken down (just the repo not the RPMs), then brought back up with that driver again. Right now, the driver is there, but it's only been working for a few hours, I don't think I'd consider it quite stable, yet.
Follow that situation here: https://github.com/NVIDIA/open-gpu-kernel-modules/issues/751
Basically, it's time to let this issue rest and we can move over to that other one for 6.13 and 570... Although apparently, the cuda repos aren't supported by this github repo, so perhaps it's best to post about it elsewhere... Maybe we find out where on that other thread... Just... Not this thread. The other people in this thread are probably tired of hearing about opensuse problems. I wanted to keep opensuse users in the loop but I don't want to spam everyone else.
Since driver 570 available in cuda this is fixed.
Well, this was fixed prior to that, in the patched 565.57 driver.
With the new driver version 570, the nvidia-open metapackage works for open drivers, and cuda-drivers for closed drivers - the situation I mentioned above holds true for the driver prior to that, for 6.12 as referenced in this thread, but isn't current any more with 6.13/570
As for 570... You've missed a lot of this show. That driver was added, then removed, along with this 565.57 driver and other previous drivers (from the repo but not from the server, still, our systems broke), then the entire repository taken down (just the repo not the RPMs), then brought back up with that driver again. Right now, the driver is there, but it's only been working for a few hours, I don't think I'd consider it quite stable, yet.
Follow that situation here: #751
Basically, it's time to let this issue rest and we can move over to that other one for 6.13 and 570... Although apparently, the cuda repos aren't supported by this github repo, so perhaps it's best to post about it elsewhere... Maybe we find out where on that other thread... Just... Not this thread. The other people in this thread are probably tired of hearing about opensuse problems. I wanted to keep opensuse users in the loop but I don't want to spam everyone else.
570 will fix what? Will my built-in laptop sccreen work again?
@username227 The symptoms that @MartinHerren reports are the same as yours, and it also happened to me. So very likely this fix will solve your problem.
Personally I chose to stay with the latest stable release, and that fixed the problem for me.
@username227 The symptoms that @MartinHerren reports are the same as yours, and it also happened to me. So very likely this fix will solve your problem.
Personally I chose to stay with the latest stable release, and that fixed the problem for me.
I'm using Arch, and everything is still using 565.77, and that is still giving me the problem. What version are you using? How can I update to 570 now?
@username227 The symptoms that @MartinHerren reports are the same as yours, and it also happened to me. So very likely this fix will solve your problem.
Personally I chose to stay with the latest stable release, and that fixed the problem for me.
by stable, you mean 550.144.3? I have still been getting the blank screen on my internal laptop with those.
At the risk of stating the obvious, there is a reason why the 570 GeForce driver hasn't been released yet, and it's not because we're just sitting on it waiting for whatever. The version that shipped with CUDA hasn't gone through the full graphics QA suite.
The version that will eventually be released will have some graphics specific fixes that didn't make the cut for the CUDA release.
Everyone is, of course, free to experiment with the CUDA driver for desktop usage, but please note that it is not a supported use.
(Sorry for going further offtopic here. Just wanted to post somewhere so it can be linked to)