open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Suspend sometimes causes a crash when using the open 555.52.04 drivers

Open urbenlegend opened this issue 1 year ago • 44 comments

NVIDIA Open GPU Kernel Modules Version

555.52.04

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

Linux arch-desktop 6.9.4-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 12 Jun 2024 20:17:17 +0000 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [X] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3090

Describe the bug

Sometimes when I attempt to suspend-to-ram, the machine will fail to suspend and instead get stuck on a black screen. I have to hard reset the machine in order to get it back. In the the system logs, there is a crash call trace for the Nvidia driver: suspend_hang.txt

To Reproduce

It happens rarely and randomly. I don't know exactly what causes it. Most of the time it can suspend fine, but sometimes it will crash

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

I've uploaded the generated bug report, but I am not sure if it includes the crash since I had to reboot before I could run the nvidia-bug-report.sh command. Doing a quick search in the log indicates that the crash trace is not in it. That is why I uploaded a separate suspend_hang.txt which does include the crash logs from the previous boot.

More Info

No response

urbenlegend avatar Jun 14 '24 10:06 urbenlegend

Thank you for the report. I've filed NVIDIA internal bug 4706166 to track this.

If you're willing to rebuild the open kernel modules, could you please apply this patch, and then upload the system log after the problem reproduces again? Thanks!

$ cat 0001-instrumentation-for-suspend-crash.patch 
From 44afc9067af6df0671724e37b8f2c2cde7386590 Mon Sep 17 00:00:00 2001
From: Andy Ritger <[email protected]>
Date: Mon, 17 Jun 2024 15:03:14 -0700
Subject: [PATCH] instrumentation for suspend crash
X-NVConfidentiality: public

---
 kernel-open/nvidia/nv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel-open/nvidia/nv.c b/kernel-open/nvidia/nv.c
index 99792de96307..bc003399cd83 100644
--- a/kernel-open/nvidia/nv.c
+++ b/kernel-open/nvidia/nv.c
@@ -3111,7 +3111,8 @@ nv_map_guest_pages(nv_alloc_t *at,
     if (pages == NULL)
     {
         nv_printf(NV_DBG_ERRORS,
-                  "NVRM: failed to allocate vmap() page descriptor table!\n");
+                  "NVRM: failed to allocate vmap() page descriptor table! (page_count: %d)\n", page_count);
+        dump_stack();
         return 0;
     }
 
@@ -3604,7 +3605,8 @@ void* NV_API_CALL nv_alloc_kernel_mapping(
             if (pages == NULL)
             {
                 nv_printf(NV_DBG_ERRORS,
-                          "NVRM: failed to allocate vmap() page descriptor table!\n");
+                          "NVRM: failed to allocate vmap() page descriptor table! (page_count:%d)\n", page_count);
+                dump_stack();
                 return NULL;
             }
 
-- 
2.44.0

aritger avatar Jun 17 '24 22:06 aritger

Thanks for the patch. I am currently on the proprietary 555.58.02 module because I need to avoid the slowdowns in KDE caused by the GSP firmware, so I have not run into this sleep issue again. Once the GSP bug is resolved, I will switch to the open module again and apply the patch to see what's going on.

urbenlegend avatar Jul 03 '24 20:07 urbenlegend

I am on proprietary nvidia 555.58.02-1 driver and I have the same problems that are listed there, I use Arch Linux, NVIDIA GeForce RTX™ 3050 Laptop GPU

I have tried linux 6.9.7 (or) linux-lts 6.6.37 NVreg_EnableS0ixPowerManagement (or) NVreg_PreserveVideoMemoryAllocations on /var/tmp (over 250GB space left) nvidia_drm.modeset 0 (or) 1 as boot parameter nvidia_drm.fbdev 0 (or) 1 as boot parameter X11 (or) Xwayland

nothing helped, (except module_blacklist=nvidia)

Jul 06 03:35:17 archlinux kernel: NVRM: failed to allocate vmap() page descriptor table!
Jul 06 03:35:17 archlinux kernel: NVRM: GPU at PCI:0000:01:00: GPU-887a46df-29b2-be1c-8c55-e637117338ba
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080205b 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 GSP RPC buffer contains function 76 (GSP_RM_CONTROL) and data 0x000000002080205b 0x0000000000000004.
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Jul 06 03:35:17 archlinux kernel: NVRM:      0    76   GSP_RM_CONTROL        0x000000002080205b 0x0000000000000004 0x00061c8a2d4a3f2a 0x0000000000000000          y
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    47   UNLOADING_GUEST_DRIVE 0x0000000000000000 0x0000000000000000 0x00061c8a2d32031f 0x00061c8a2d34fd27 195080us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    10   FREE                  0x00000000c1e016c0 0x0000000000000000 0x00061c8a2d320088 0x00061c8a2d3202fc    628us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    10   FREE                  0x000000000000000a 0x0000000000000000 0x00061c8a2d31fa32 0x00061c8a2d320087   1621us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    10   FREE                  0x000000000000000b 0x0000000000000000 0x00061c8a2d31f763 0x00061c8a2d31f943    480us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    10   FREE                  0x0000000000000006 0x0000000000000000 0x00061c8a2d31f52a 0x00061c8a2d31f75e    564us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    10   FREE                  0x0000000000000002 0x0000000000000000 0x00061c8a2d31e4fe 0x00061c8a2d31f4fd   4095us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    10   FREE                  0x0000000000000005 0x0000000000000000 0x00061c8a2d31da4b 0x00061c8a2d31e4fb   2736us  
Jul 06 03:35:17 archlinux kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Jul 06 03:35:17 archlinux kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Jul 06 03:35:17 archlinux kernel: NVRM:      0    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2d32b15b 0x00061c8a2d32b15c      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -1    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000028 0x00061c8a2d324d79 0x00061c8a2d324d7b      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -2    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d2fb48c 0x00061c8a2d2fb48c           
Jul 06 03:35:17 archlinux kernel: NVRM:     -3    4111 PERF_BRIDGELESS_INFO_ 0x0000000000000000 0x0000000000000000 0x00061c8a2d24eef8 0x00061c8a2d24eef9      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -4    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce35a01 0x00061c8a2ce35a01           
Jul 06 03:35:17 archlinux kernel: NVRM:     -5    4108 UCODE_LIBOS_PRINT     0x0000000000000000 0x0000000000000000 0x00061c8a2ce357ff 0x00061c8a2ce35800      1us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -6    4128 GSP_POST_NOCAT_RECORD 0x0000000000000002 0x0000000000000027 0x00061c8a2ce33db1 0x00061c8a2ce33db3      2us  
Jul 06 03:35:17 archlinux kernel: NVRM:     -7    4098 GSP_RUN_CPU_SEQUENCER 0x000000000000060a 0x0000000000003fe2 0x00061c8a2ce27b5b 0x00061c8a2ce28c8b   4400us  
Jul 06 03:35:17 archlinux kernel: CPU: 4 PID: 8874 Comm: kworker/u48:11 Tainted: P           OE      6.9.7-arch1-1 #1 44783200744f92500e6484c6d93590bc19db4a83
Jul 06 03:35:17 archlinux kernel: Hardware name: Micro-Star International Co., Ltd. Thin GF63 12UC/MS-16R8, BIOS E16R8IMS.111 03/21/2024
Jul 06 03:35:17 archlinux kernel: Workqueue: async async_run_entry_fn
Jul 06 03:35:17 archlinux kernel: Call Trace:
Jul 06 03:35:17 archlinux kernel:  <TASK>
Jul 06 03:35:17 archlinux kernel:  dump_stack_lvl+0x5d/0x80
Jul 06 03:35:17 archlinux kernel:  _nv012672rm+0x437/0x4b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv012592rm+0x74/0x330 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv046348rm+0x49f/0x7f0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv049583rm+0xa1/0x150 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045638rm+0x19e/0x1b0 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv047612rm+0x3fc/0x500 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv014430rm+0x42e/0x690 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv045777rm+0x26/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000751rm+0x55/0x70 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000750rm+0x21b/0x220 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  _nv000701rm+0x2ad/0x300 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  rm_power_management+0x22c/0x260 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  ? wait_for_completion+0x91/0x170
Jul 06 03:35:17 archlinux kernel:  nv_power_management+0x92/0x170 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nvidia_suspend+0x6c/0x100 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  nv_pmops_suspend+0x15/0x30 [nvidia 1c25e0c66b648a34b428942d4985971b2b3325f6]
Jul 06 03:35:17 archlinux kernel:  pci_pm_suspend+0x7c/0x170
Jul 06 03:35:17 archlinux kernel:  ? __pfx_pci_pm_suspend+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  dpm_run_callback+0x47/0x150
Jul 06 03:35:17 archlinux kernel:  device_suspend+0x141/0x510
Jul 06 03:35:17 archlinux kernel:  ? try_to_wake_up+0x76/0x660
Jul 06 03:35:17 archlinux kernel:  async_suspend+0x1d/0x30
Jul 06 03:35:17 archlinux kernel:  async_run_entry_fn+0x31/0x140
Jul 06 03:35:17 archlinux kernel:  process_one_work+0x18b/0x350
Jul 06 03:35:17 archlinux kernel:  worker_thread+0x2eb/0x410
Jul 06 03:35:17 archlinux kernel:  ? __pfx_worker_thread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  kthread+0xcf/0x100
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork+0x31/0x50
Jul 06 03:35:17 archlinux kernel:  ? __pfx_kthread+0x10/0x10
Jul 06 03:35:17 archlinux kernel:  ret_from_fork_asm+0x1a/0x30
Jul 06 03:35:17 archlinux kernel:  </TASK>
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a81 0x4).
Jul 06 03:35:17 archlinux kernel: NVRM: Xid (PCI:0000:01:00): 119, pid=8874, name=kworker/u48:11, Timeout after 6s of waiting for RPC response from GPU0 GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a76 0x2).
Jul 06 03:35:17 archlinux kernel: NVRM: Rate limiting GSP RPC error prints for GPU at PCI:0000:01:00 (printing 1 of every 30).  The GPU likely needs to be reset.
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to determine display capabilities
Jul 06 03:35:17 archlinux kernel: nvidia-modeset: ERROR: GPU:0: Failed to tear down Disp
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: pci_pm_suspend(): nv_pmops_suspend+0x0/0x30 [nvidia] returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: dpm_run_callback(): pci_pm_suspend+0x0/0x170 returns -5
Jul 06 03:35:17 archlinux kernel: nvidia 0000:01:00.0: PM: failed to suspend async: error -5
Jul 06 03:35:17 archlinux kernel: PM: Some devices failed to suspend, or early wake event detected
Jul 06 03:35:17 archlinux kernel: iwlwifi 0000:00:14.3: WRT: Invalid buffer destination
Jul 06 03:35:17 archlinux kernel: done.

abfipes12 avatar Jul 06 '24 14:07 abfipes12

Also seeing this on RTX 2070 with the proprietary 555.58.02 driver.

belegdol avatar Jul 14 '24 20:07 belegdol

I also had this issue when the GPU devices suspending by PCI-E Power Management, this can be reproduced by activating NVIDIA Drain Mode (for hybrid notebooks). I am also using NVIDIA Open Kernel Modules 555.58.02.

sudo nvidia-smi drain -p 0000:01:00.0 -m 1 (my PCI ID for the GPU is 0000:01:00.0)

Distribution: Arch Linux x86_64 GPU: NVIDIA GeForce RTX 3050 Mobile CPU: Intel Core i5-12500H

❯ cat /sys/bus/pci/devices/0000:01:00.0/power/runtime_status
error

This kind of error also happened whether using the GSP offload or not (nvidia.NVreg_EnableGpuFirmware=0 kernel parameter)

nvidia-gspoff.log nvidia-gspon.log

anandadfoxx avatar Jul 27 '24 03:07 anandadfoxx

A similar thing is happening to me on the proprietary drivers 555.58.02. I've tried it on the latest LTS kernel but that doesn't seem to resolve the issue.

nvidia-bug-report.log.gz hang.txt

cubusXD avatar Jul 29 '24 15:07 cubusXD

Same for me, it almost always reproduces - on occasion it manages to get into suspend, but usually the power LED just stays on of the laptop, pressing something turns on the fans again, but the screen never comes back.

nvidia-bug-report.log.gz - captured after forcibly rebooting after the issue occurred.

Here are some kernel logs as well as they don't appear to be part of the bug report generated above:

KernelLogs.txt

As can be seen, I see a bunch of warnings and kernel backtraces around nv_set_system_power_state happening before the suspend fails:

aug 02 18:57:31 hephaestus kernel: ---[ end trace 0000000000000000 ]---
aug 02 18:57:31 hephaestus kernel:  </TASK>
aug 02 18:57:31 hephaestus kernel: R13: 00005b578e6140d0 R14: 00007eed2a6085c0 R15: 00007eed2a605ea0
aug 02 18:57:31 hephaestus kernel: R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000008
aug 02 18:57:31 hephaestus kernel: RBP: 00007fff7a951ad0 R08: 0000000000000410 R09: 0000000000000001
aug 02 18:57:31 hephaestus kernel: RDX: 0000000000000008 RSI: 00005b578e6140d0 RDI: 0000000000000001
aug 02 18:57:31 hephaestus kernel: RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007eed2a52c7a4
aug 02 18:57:31 hephaestus kernel: RSP: 002b:00007fff7a951aa8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
aug 02 18:57:31 hephaestus kernel: Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 28 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
aug 02 18:57:31 hephaestus kernel: RIP: 0033:0x7eed2a52c7a4
aug 02 18:57:31 hephaestus kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? do_syscall_64+0x8e/0x190
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  ? syscall_exit_to_user_mode+0x73/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  do_syscall_64+0x82/0x190
aug 02 18:57:31 hephaestus kernel:  __x64_sys_write+0x72/0xf0
aug 02 18:57:31 hephaestus kernel:  ? __do_sys_newfstat+0xc7/0x100
aug 02 18:57:31 hephaestus kernel:  ? srso_alias_return_thunk+0x5/0xfbef5
aug 02 18:57:31 hephaestus kernel:  vfs_write+0xe6/0x4a0
aug 02 18:57:31 hephaestus kernel:  proc_reg_write+0x5a/0xa0
aug 02 18:57:31 hephaestus kernel:  nv_procfs_write_suspend+0xef/0x170 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  nv_set_system_power_state+0x1cd/0x470 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  nv_revoke_gpu_mappings_locked+0x47/0x70 [nvidia dcead3f0be4643c87dfa729fd3a69234fec29f3f]
aug 02 18:57:31 hephaestus kernel:  unmap_mapping_range+0x116/0x140
aug 02 18:57:31 hephaestus kernel:  zap_page_range_single+0x222/0x260
aug 02 18:57:31 hephaestus kernel:  untrack_pfn+0x59/0x160
aug 02 18:57:31 hephaestus kernel:  follow_phys+0x49/0x110
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? asm_exc_invalid_op+0x1a/0x20
aug 02 18:57:31 hephaestus kernel:  ? exc_invalid_op+0x19/0xc0
aug 02 18:57:31 hephaestus kernel:  ? handle_bug+0x3c/0x80
aug 02 18:57:31 hephaestus kernel:  ? report_bug+0xe7/0x210
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  ? __warn.cold+0x8e/0xf3
aug 02 18:57:31 hephaestus kernel:  ? follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel:  <TASK>
aug 02 18:57:31 hephaestus kernel: Call Trace:
aug 02 18:57:31 hephaestus kernel: PKRU: 55555554
aug 02 18:57:31 hephaestus kernel: CR2: 00007eed2a608650 CR3: 000000010677a000 CR4: 0000000000f50ef0
aug 02 18:57:31 hephaestus kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
aug 02 18:57:31 hephaestus kernel: FS:  00007eed2a3a3b80(0000) GS:ffff9e95cdb00000(0000) knlGS:0000000000000000
aug 02 18:57:31 hephaestus kernel: R13: ffff9e8ed0721080 R14: ffffbdef1247fce0 R15: ffffffffffffffff
aug 02 18:57:31 hephaestus kernel: R10: 00007709f8e52fff R11: ffff9e8f0e5ad600 R12: ffffbdef1247fb70
aug 02 18:57:31 hephaestus kernel: RBP: ffffbdef1247fb78 R08: 0000000000000020 R09: ffffffffffffffff
aug 02 18:57:31 hephaestus kernel: RDX: ffffbdef1247fb70 RSI: 00007709f34b6000 RDI: ffff9e8efb21a8a0
aug 02 18:57:31 hephaestus kernel: RAX: 0000000000000000 RBX: 00007709f34b6000 RCX: ffffbdef1247fb78
aug 02 18:57:31 hephaestus kernel: RSP: 0018:ffffbdef1247fb38 EFLAGS: 00010246
aug 02 18:57:31 hephaestus kernel: Code: e9 ee 8a f1 00 48 25 00 00 00 c0 48 09 d0 c4 e2 f8 f2 c7 75 20 e8 5e e3 ff ff 48 8b 15 57 fa 72 01 48 81 e2 00 00 00 c0 eb 8c <0f> 0b 48 3b 1f 0f 83 6c fe ff ff 41 be ea ff ff ff eb b6 48 8b 7d
aug 02 18:57:31 hephaestus kernel: RIP: 0010:follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel: Hardware name: LENOVO 82WS/LNVNB161216, BIOS LPCN51WW 04/22/2024
aug 02 18:57:31 hephaestus kernel: CPU: 26 PID: 3496 Comm: nvidia-sleep.sh Tainted: G        W  OE      6.10.2-arch1-1.1 #1 856328b22fcd0da354f276ff67275d0fcc220438
aug 02 18:57:31 hephaestus kernel:  ucsi_acpi btintel snd_rn_pci_acp3x libarc4 snd_pcm kvm_amd vboxdrv(OE) typec_ucsi ideapad_laptop snd_acp_config btbcm realtek nvidia_modeset(OE) cfg80211 typec videobuf2_common r8152 snd_timer btmtk sp5100_tco pkcs8_key_parser snd_soc_acpi mdio_devres sparse_keymap hid_multitouch kvm bluetooth crc16 mii mc rapl wdat_wdt pcspkr wmi_bmof k10temp i2c_piix4 snd snd_pci_acp3x libphy rfkill roles mousedev joydev apple_mfi_fastcharge nvidia_uvm(OE) soundcore legion_laptop(OE) i2c_hid_acpi platform_profile crc8 i2c_hid mac_hid nvidia(OE) i2c_dev crypto_user loop nfnetlink ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee hid_generic usbhid dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic gf128mul ghash_clmulni_intel serio_raw sha512_ssse3 atkbd sha256_ssse3 libps2 sha1_ssse3 vivaldi_fmap aesni_intel nvme crypto_simd cryptd nvme_core xhci_pci ccp i8042 xhci_pci_renesas nvme_auth video serio wmi
aug 02 18:57:31 hephaestus kernel: Modules linked in: overlay snd_seq_dummy snd_hrtimer snd_seq vfat fat r8153_ecm cdc_ether usbnet mt7921e mt7921_common mt792x_lib mt76_connac_lib mt76 snd_sof_amd_acp63 snd_sof_amd_vangogh snd_sof_amd_rembrandt snd_sof_amd_renoir amd_atl intel_rapl_msr snd_sof_amd_acp intel_rapl_common snd_sof_pci snd_hda_codec_realtek snd_sof_xtensa_dsp snd_hda_codec_generic snd_sof snd_hda_scodec_component snd_sof_utils snd_hda_codec_hdmi snd_pci_ps snd_amd_sdw_acpi soundwire_amd snd_hda_scodec_tas2781_i2c soundwire_generic_allocation snd_soc_tas2781_fmwlib snd_hda_intel soundwire_bus uvcvideo snd_soc_tas2781_comlib snd_intel_dspcfg snd_usb_audio snd_intel_sdw_acpi videobuf2_vmalloc snd_soc_core snd_rpl_pci_acp6x uvc snd_hda_codec snd_usbmidi_lib snd_acp_pci vboxnetflt(OE) videobuf2_memops snd_ump vboxnetadp(OE) snd_acp_legacy_common snd_compress snd_hda_core btusb videobuf2_v4l2 snd_rawmidi snd_pci_acp6x ac97_bus mac80211 btrtl snd_hwdep snd_seq_device snd_pci_acp5x snd_pcm_dmaengine nvidia_drm(OE) videodev r8169
aug 02 18:57:31 hephaestus kernel: WARNING: CPU: 26 PID: 3496 at include/linux/rwsem.h:80 follow_pte+0x1c2/0x1f0
aug 02 18:57:31 hephaestus kernel: ------------[ cut here ]------------

Followed by a bunch of NVRM errors:

aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pKernelBus->pReadToFlush != NULL || pKernelBus->virtualBar2[GPU_GFID_PF].pCpuMapping != NULL @ kern_bus_gv100.c:388
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:72
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkUnmap: Failed to unmap VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == unmapStatus @ mmu_walk_sparse.c:95
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkSparsify: Unmap failed with status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_unmap.c:72
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkUnmap: Failed to unmap VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mmu_walk_sparse.c:84
aug 02 18:57:32 hephaestus kernel: NVRM: mmuWalkSparsify: Failed to sparsify VA Range 0x19a0000 to 0x19dffff. Status = 0x00000040
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: NV_OK == status @ mmu_walk.c:489
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: progress == indexHi_tmp - indexLo_tmp + 1 @ mmu_walk.c:1303
aug 02 18:57:32 hephaestus kernel: NVRM: nvAssertFailedNoLog: Assertion failed: pEntries != NULL @ gmmu_walk.c:852
aug 02 18:57:32 hephaestus kernel: [drm:__nv_drm_semsurf_wait_fence_work_cb [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register auto-value-update on pre-wait value for sync FD semaphore surface

I can switch between s2idle and deep sleep modes, and both exhibit the same problems. I also tested with S0ix enabled and without, but to no avail.

Gert-dev avatar Aug 02 '24 17:08 Gert-dev

I get the same problem "NVRM: failed to allocate vmap() page descriptor table!". I tested the proprietary 555 and the open 560 beta driver.

It seems like the issue happens if I have less free system RAM than I have VRAM.

My system:

  • RTX 4060 Ti 16Gb
  • 64 GB system RAM
  • Manjaro
  • Gnome on wayland
Aug 03 23:32:58 lillypod systemd[1]: Reached target Sleep.
Aug 03 23:32:58 lillypod systemd[1]: Starting Suspend gnome-shell...
Aug 03 23:32:58 lillypod systemd[1]: gnome-shell-suspend.service: Deactivated successfully.
Aug 03 23:32:58 lillypod systemd[1]: Finished Suspend gnome-shell.
Aug 03 23:32:58 lillypod systemd[1]: Starting NVIDIA system suspend actions...
Aug 03 23:32:58 lillypod suspend[113863]: nvidia-suspend.service
Aug 03 23:32:58 lillypod logger[113863]: <13>Aug  3 23:32:58 suspend: nvidia-suspend.service
Aug 03 23:32:58 lillypod wireplumber[2282]: wplua: [string "alsa.lua"]:182: attempt to concatenate a nil value (local 'node_name')
                                            stack traceback:
                                                    [string "alsa.lua"]:182: in function <[string "alsa.lua"]:175>
Aug 03 23:32:58 lillypod wireplumber[2282]: wplua: [string "alsa.lua"]:182: attempt to concatenate a nil value (local 'node_name')
                                            stack traceback:
                                                    [string "alsa.lua"]:182: in function <[string "alsa.lua"]:175>
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/ldac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aptx_hd
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_hd
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aptx
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/aac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aac
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_g
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_g
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/sbc
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/sbc
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_1
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_0
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_duplex_1
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/aptx_ll_duplex_0
Aug 03 23:32:58 lillypod kernel: rfkill: input handler enabled
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/faststream
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/faststream_duplex
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_05
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_05
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSink/opus_05_duplex
Aug 03 23:32:58 lillypod bluetoothd[1437]: Endpoint unregistered: sender=:1.219 path=/MediaEndpoint/A2DPSource/opus_05_duplex
Aug 03 23:33:08 lillypod systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Aug 03 23:33:09 lillypod kernel: NVRM: failed to allocate vmap() page descriptor table!
Aug 03 23:33:09 lillypod kernel: ------------[ cut here ]------------
Aug 03 23:33:09 lillypod kernel: WARNING: CPU: 2 PID: 113865 at /var/lib/dkms/nvidia/560.28.03/build/nvidia/nv.c:4598 nv_set_system_power_state+0x40d/0x470 [nvidia]
Aug 03 23:33:09 lillypod kernel: Modules linked in: dm_crypt cbc encrypted_keys trusted asn1_encoder tee rfcomm cmac algif_hash algif_skcipher af_alg hid_logitech_hidpp bnep intel_rapl_msr intel_rapl_common btusb btrtl btintel btbcm btmtk bluetooth snd_seq_dummy ip6table_filter ip>
Aug 03 23:33:09 lillypod kernel:  i2c_piix4 k10temp libcrc32c snd libphy soundcore gpio_amdpt gpio_generic mac_hid vboxnetflt(OE) vboxnetadp(OE) vboxdrv(OE) i2c_dev crypto_user fuse dm_mod loop nfnetlink bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 usbhid >
Aug 03 23:33:09 lillypod kernel: CPU: 2 PID: 113865 Comm: nvidia-sleep.sh Tainted: P           OE      6.6.41-1-MANJARO #1 3ef3dc680c6ec404f036b5609d7802e8bb7ca22a
Aug 03 23:33:09 lillypod kernel: Hardware name: ASRock B650M Pro RS WiFi/B650M Pro RS WiFi, BIOS 3.01 05/13/2024
Aug 03 23:33:09 lillypod kernel: RIP: 0010:nv_set_system_power_state+0x40d/0x470 [nvidia]
Aug 03 23:33:09 lillypod kernel: Code: 0f eb 40 4d 8b a4 24 f8 05 00 00 4d 85 e4 74 33 49 8b bc 24 d0 02 00 00 ba 01 00 00 00 89 de e8 59 c9 ff ff 89 c5 85 c0 74 d9 <0f> 0b 48 c7 c7 80 05 fc c0 41 bd 01 00 00 00 e8 cf 55 9d ce e9 f9
Aug 03 23:33:09 lillypod kernel: RSP: 0018:ffffc9000149ba80 EFLAGS: 00010206
Aug 03 23:33:09 lillypod kernel: RAX: 000000000000ffff RBX: 0000000000000001 RCX: 0000000080020000
Aug 03 23:33:09 lillypod kernel: RDX: ffff888100d705d8 RSI: 0000000000000286 RDI: ffff888100d705d0
Aug 03 23:33:09 lillypod kernel: RBP: 000000000000ffff R08: 0000000000000000 R09: 0000000080020000
Aug 03 23:33:09 lillypod kernel: R10: ffff8881e723b000 R11: 0000000000000000 R12: ffff888100d70000
Aug 03 23:33:09 lillypod kernel: R13: ffff888100d705d0 R14: ffff8881e7238000 R15: ffff8881e7238000
Aug 03 23:33:09 lillypod kernel: FS:  00007f999991bb80(0000) GS:ffff888ffe680000(0000) knlGS:0000000000000000
Aug 03 23:33:09 lillypod kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 03 23:33:09 lillypod kernel: CR2: 000000c000be2000 CR3: 00000001102c0000 CR4: 0000000000f50ee0
Aug 03 23:33:09 lillypod kernel: PKRU: 55555554
Aug 03 23:33:09 lillypod kernel: Call Trace:
Aug 03 23:33:09 lillypod kernel:  <TASK>
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  ? __warn+0x81/0x130
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  ? report_bug+0x16f/0x1a0
Aug 03 23:33:09 lillypod kernel:  ? handle_bug+0x3c/0x80
Aug 03 23:33:09 lillypod kernel:  ? exc_invalid_op+0x17/0x70
Aug 03 23:33:09 lillypod kernel:  ? asm_exc_invalid_op+0x1a/0x20
Aug 03 23:33:09 lillypod kernel:  ? nv_set_system_power_state+0x40d/0x470 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  nv_procfs_write_suspend+0xe1/0x160 [nvidia 0a57c395b1423c4cb77f02e496bb79cca2561369]
Aug 03 23:33:09 lillypod kernel:  proc_reg_write+0x5a/0xa0
-------------------------------- SNIP ------------------------------------------

Gunther-Schulz avatar Aug 08 '24 08:08 Gunther-Schulz

I seem to have similar sleep issues, but I found this, wondering if anyone has tried: https://gist.github.com/bmcbm/375f14eaa17f88756b4bdbbebbcfd029

If I keep GPU usage low when I sleep it seems to be ok as well, but this other sleep stuff might be getting in the way..

josefwells avatar Aug 25 '24 01:08 josefwells

Count me in.

Happens every time on suspend when using 4070S + Linux 6.10 and NVIDIA driver 560.35.03:

kernel-backtrace.txt

At first I thought it was specific to the proprietary driver but nope, affects the open source driver as well.

And trying to disable nvidia-sleep.sh results in a system unable to resume (the screen doesn't turn on).

birdie-github avatar Sep 09 '24 12:09 birdie-github

Suggestions in the other bug helped me. (Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate) and adding the options to the modprobe.d file.

Nvidia-open, which I read is recommended by Nvidia at 555.* and beyond.

4080 super, desktop.

josefwells avatar Sep 09 '24 13:09 josefwells

Suggestions in the other bug helped me.

Which ones?

(Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate)

Already enabled.

and adding the options to the modprobe.d file.

Which ones?

birdie-github avatar Sep 09 '24 19:09 birdie-github

I have the same error.

Brensom avatar Sep 09 '24 20:09 Brensom

Suggestions in the other bug helped me.

Which ones?

These.

(Arch BTW) Enabling nvidia-* services (suspend, resume, hibernate)

Already enabled.

and adding the options to the modprobe.d file.

Which ones?

options nvidia-drm fbdev=1 options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp

Now I am pretty sure that nvidia-drm is both wrong (nvidia_drm) and not needed, but also that it "works".

The others may have helped or it may just have been enabling the nvidia-* services. Sounds like you are seeing issues, so trying the additional options might help out.

josefwells avatar Sep 09 '24 21:09 josefwells

fbdev=1 used to have a ton of bugs in version 555 (for instance switching back to Xorg from Linux console resulted in a dead system as the screen just turned black), but I may give it a try.

I also did not like options nvidia NVreg_PreserveVideoMemoryAllocations=1, maybe it's time to try it again.

I'm using only this:

options nvidia NVreg_EnableS0ixPowerManagement=1
options nvidia-drm modeset=1

birdie-github avatar Sep 09 '24 22:09 birdie-github

I've enabled options nvidia-drm modeset=1 fbdev=1 but that didn't help at all.

Let's try without modeset.

birdie-github avatar Sep 10 '24 09:09 birdie-github

Nothing works. Still getting a ton of kernel oopses.

birdie-github avatar Sep 17 '24 09:09 birdie-github

@aritger

Any updates? It's been quite a while.

birdie-github avatar Sep 26 '24 10:09 birdie-github

This is supposed to be addressed in our upcoming 565.xx release, but I don't know when that is scheduled to be released. Thanks for your patience, and sorry for the delays.

aritger avatar Sep 26 '24 18:09 aritger

There's no specific mention of this one in the list of fixes for the 565.57.01 beta release.

  • https://www.nvidia.com/en-us/drivers/details/233008/

But, here's hoping!

tekstryder avatar Oct 22 '24 13:10 tekstryder

This is not fixed for me in 565.57.01. The bug was filed four months ago :(

#719

@aritger any idea why the fix hasn't found its way into this beta?

birdie-github avatar Oct 22 '24 16:10 birdie-github

The fix I mentioned previously is included in 565.57.01. I suspect #719 is a different bug with a similar symptom; I'll follow up there. I apologize for the continued problems. As always, the best thing that will help is to capture a full nvidia-bug-report.log.gz, so that we have all the relevant information about the system configuration.

aritger avatar Oct 22 '24 16:10 aritger

For info. I'm on nvidia driver version 555.58.02

After a lot of trial, error and luck, I commented out the fbdev=1 line in the /etc/modprobe.d/nvidia-graphics-drivers-kms.conf file and resume started to work again. Oh and don't forget to sudo update-initramfs -u and reboot first of course......

options nvidia-drm modeset=1 #options nvidia-drm fbdev=1 options nvidia NVreg_PreserveVideoMemoryAllocations=1 options nvidia NVreg_TemporaryFilePath=/var/tmp

avoiceofreason avatar Oct 31 '24 13:10 avoiceofreason

Is the name of the module different in open vs closed modules? Are the below the closed or the open ones?

❯ lsmod | grep ^nvidia
nvidia_wmi_ec_backlight    12288  0
nvidia_uvm           4050944  0
nvidia_drm            147456  6
nvidia_modeset       1822720  4 nvidia_drm
nvidia              96923648  48 nvidia_uvm,nvidia_modeset

I am having the same, and I noticed that it happens mostly after I use Chrome. I generally use Firefox and can go up to 5 days with working suspend resume, but if I run Chrome, I get the error within 1 day of it running (mostly I'd have YouTube tabs open in it).

Here is the stack trace I have:

Feb 22 01:20:18 kernel: WARNING: CPU: 3 PID: 1243496 at /tmp/akmodsbuild.HzZLQ2gI/BUILD/nvidia-kmod-570.86.16-build/nvidia-kmod-570.86.16-x8
6_64/_kmod_build_6.12.9-200.fc41.x86_64/kernel/nvidia/nv.c:4561 nv_set_system_power_state+0x419/0x480 [nvidia]
Feb 22 01:20:18 kernel: Modules linked in: vhost_net vhost vhost_iotlb tap tun nft_reject_ipv4 act_csum cls_u32 sch_htb nf_nat_tftp nf_connt
rack_tftp uinput rfcomm snd_seq_dummy snd_hrtimer nft_masq nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_
ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
 ip_set nf_tables veth bridge stp llc qrtr overlay bnep binfmt_misc snd_ctl_led snd_soc_skl_hda_dsp snd_soc_intel_sof_board_helpers snd_soc_intel_hda_dsp_common snd_sof_probes vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component snd_soc_dmic snd_sof_pci_intel_mtl snd_sof_intel_hda_generic soundwire_intel soundwire_cadence snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci iwlmvm snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match mac80211 soundwire_generic_allocation snd_soc_acpi soundwire_bus
Feb 22 01:20:18 kernel:  snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine libarc4 snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device processor_thermal_device_pci iTCO_wdt snd_pcm mei_gsc_proxy iwlwifi uvcvideo intel_uncore_frequency intel_pmc_bxt btusb iTCO_vendor_support uvc btrtl intel_uncore_frequency_common videobuf2_vmalloc processor_thermal_device btintel processor_thermal_wt_hint snd_timer btbcm videobuf2_memops spi_nor processor_thermal_rfim x86_pkg_temp_thermal intel_rapl_msr mtd intel_powerclamp coretemp kvm_intel videobuf2_v4l2 btmtk kvm rapl intel_cstate intel_uncore pcspkr videobuf2_common bluetooth processor_thermal_rapl snd mei_me i2c_i801 cfg80211 nvidia_wmi_ec_backlight intel_rapl_common wmi_bmof spi_intel_pci spi_intel processor_thermal_wt_req mei soundcore i2c_smbus idma64 thunderbolt processor_thermal_power_floor intel_vpu processor_thermal_mbox igen6_edac ideapad_laptop platform_profile rfkill int3403_thermal int340x_thermal_zone intel_pmc_core
Feb 22 01:20:18 kernel:  intel_vsec int3400_thermal intel_hid pmt_telemetry acpi_tad acpi_thermal_rel pmt_class sparse_keymap acpi_pad joydev auth_rpcgss scsi_dh_rdac scsi_dh_emc scsi_dh_alua kvmfr(O) sunrpc loop dm_multipath nfnetlink zram lz4hc_compress lz4_compress dm_crypt hid_sensor_hub intel_ishtp_hid xe rtsx_pci_sdmmc nvme mmc_core crct10dif_pclmul nvme_core crc32_pclmul crc32c_intel polyval_clmulni intel_ish_ipc polyval_generic hid_multitouch ghash_clmulni_intel ucsi_acpi sha512_ssse3 sha256_ssse3 sha1_ssse3 typec_ucsi rtsx_pci intel_ishtp nvme_auth drm_gpuvm typec i2c_hid_acpi i2c_hid pinctrl_meteorlake serio_raw nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) amdgpu amdxcp drm_exec gpu_sched drm_suballoc_helper drm_ttm_helper i915 zfs(PO) drm_buddy video wmi i2c_algo_bit drm_display_helper cec ttm spl(O) v4l2loopback(O) videodev mc fuse i2c_dev
Feb 22 01:20:18 kernel: Unloaded tainted modules: nvidia_peermem(PO):1
Feb 22 01:20:18 kernel: Unloaded tainted modules: nvidia_peermem(PO):1
Feb 22 01:20:18 kernel: CPU: 3 UID: 0 PID: 1243496 Comm: nvidia-sleep.sh Tainted: P        W  O       6.12.9-200.fc41.x86_64 #1
Feb 22 01:20:18 kernel: Tainted: [P]=PROPRIETARY_MODULE, [W]=WARN, [O]=OOT_MODULE
Feb 22 01:20:18 kernel: Hardware name: LENOVO 83DN/INVALID, BIOS NKCN24WW 01/16/2024
Feb 22 01:20:18 kernel: RIP: 0010:nv_set_system_power_state+0x419/0x480 [nvidia]
Feb 22 01:20:18 kernel: Code: 0f eb 40 4d 8b a4 24 d0 06 00 00 4d 85 e4 74 33 49 8b bc 24 30 03 00 00 ba 01 00 00 00 89 de e8 0d c8 ff ff 89 c5 85 c0 74 d9 <0f> 0b 48 c7 c7 d0 34 47 c3 41 bc 01 00 00 00 e8 53 fe c5 dd e9 ed
Feb 22 01:20:18 kernel: RSP: 0018:ffffaa9f2b07bd20 EFLAGS: 00010206
Feb 22 01:20:18 kernel: RAX: 000000000000ffff RBX: 0000000000000001 RCX: 0000000080020001
Feb 22 01:20:18 kernel: RDX: ffff9dea1df3a6b0 RSI: 0000000000000282 RDI: ffff9dea1df3a6a8
Feb 22 01:20:18 kernel: RBP: 000000000000ffff R08: ffff9dea73ce0000 R09: 0000000080020001
Feb 22 01:20:18 kernel: R10: 0000000080020001 R11: ffff9df15efa17c0 R12: ffff9dea1df3a000
Feb 22 01:20:18 kernel: R13: ffff9dea1df3a6a8 R14: ffff9dea73ce0000 R15: ffff9dea73ce0000
Feb 22 01:20:18 kernel: FS:  00007f4cc160e740(0000) GS:ffff9df15ef80000(0000) knlGS:0000000000000000
Feb 22 01:20:18 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 22 01:20:18 kernel: CR2: 0000559a6cf0ae78 CR3: 0000000123c0c003 CR4: 0000000000f72ef0
Feb 22 01:20:18 kernel: PKRU: 55555554
Feb 22 01:20:18 kernel: Call Trace:
Feb 22 01:20:18 kernel:  <TASK>
Feb 22 01:20:18 kernel:  ? nv_set_system_power_state+0x419/0x480 [nvidia]
Feb 22 01:20:18 kernel:  ? __warn.cold+0x93/0xfa
Feb 22 01:20:18 kernel:  ? nv_set_system_power_state+0x419/0x480 [nvidia]
Feb 22 01:20:18 kernel:  ? report_bug+0xff/0x140
Feb 22 01:20:18 kernel:  ? handle_bug+0x58/0x90
Feb 22 01:20:18 kernel:  ? exc_invalid_op+0x17/0x70
Feb 22 01:20:18 kernel:  ? asm_exc_invalid_op+0x1a/0x20
Feb 22 01:20:18 kernel:  ? nv_set_system_power_state+0x419/0x480 [nvidia]
Feb 22 01:20:18 kernel:  nv_procfs_write_suspend+0x105/0x1b0 [nvidia]
Feb 22 01:20:18 kernel:  proc_reg_write+0x57/0xa0
Feb 22 01:20:18 kernel:  vfs_write+0xf5/0x450
Feb 22 01:20:18 kernel:  ksys_write+0x6d/0xf0
Feb 22 01:20:18 kernel:  do_syscall_64+0x82/0x160
Feb 22 01:20:18 kernel:  ? do_user_addr_fault+0x55a/0x7b0
Feb 22 01:20:18 kernel:  ? exc_page_fault+0x7e/0x180
Feb 22 01:20:18 kernel:  entry_SYSCALL_64_after_hwframe+0x76/0x7e

I have added the below module options to see if I can go 7 days running Chrome:

❯ cat /etc/modprobe.d/nvidia-bug.conf 
# https://github.com/NVIDIA/open-gpu-kernel-modules/issues/662#issuecomment-2339227548
# options nvidia-drm fbdev=1
options nvidia NVreg_PreserveVideoMemoryAllocations=1
options nvidia NVreg_TemporaryFilePath=/var/tmp

EDIT: lasted less than an hour before kernel crash (different stack trace)

Will undo the module options and try disabling hardware acceleration in Chrome to see if at least it works that way. But it definitely seems to me that Chrome makes crashes more likely to trigger.

karypid avatar Feb 22 '25 13:02 karypid

I have the same issue with 570 and nothing seems to fix it

MAHBOD-85 avatar Mar 02 '25 22:03 MAHBOD-85

ok so reading the comments above i may have fixed the problem my laptop is an HP Victus 15 with muxless nvidia GPU which has no video output when i apply nvidia_drm.modeset=0 nvidia_drm.fbdev=0 the problem seems to disappear or atleast be way less apparent fbdev is on by default on nvidia drivers so you will have to apply this as a kernel parameter to guarantee it to work i can get away with turning off drm modeset because my laptop has no output from the nvidia gpu but desktop users should probably keep modeset on i think we need to look at what fbdev does wrong here and it's out of my scope since the code for that is proprietary

MAHBOD-85 avatar Mar 04 '25 22:03 MAHBOD-85

Dear Nvidia, This bug is unbearable and makes any system running this driver highly unstable as the Nvidia GPU's driver could crash at any moment and take down the whole system with it.

At my surprise that such a critical bug has gone unfixed for 10 months by a company as big as yours and as I have been watching this issue on github for the past whole month I couldn't bare anymore not to finally break my silence and urge you to fix it as fast as possible as the latest version of this driver is still prone to this bug which VERY frequently crashes the GPU, making the system unusable. Every day since I had this driver, never did a single day pass without a GPU crash linked to this!

Without NOUVEAU as a fallback, the GPU becomes unusable, requiring a whole system restart to reset the clock until another imminent recrash of the whole system all over again.

So to conclude, Fix it for heaven's sake!

StarShine1A avatar Apr 16 '25 16:04 StarShine1A

i return to this issue to say that it happened again and i think i may have gotten things wrong since last time, the temporary fix was not the product of disabling fbdev but an indirect product of disabling modeset (which i had to reenable because of some unity game) basically when i applied the temporary fix every time the GPU would suspend on my optimus laptop it would have no lingering processes in it since a wayland compositor could not work with a gpu without DRM modesetting and when i introduced another lingering process like steam with hardware acceleration disabled (since the gui would vanish with it enabled, but it would also create no lingering process on the nvidia GPU) the crash showed itself again in short the crash may or may not happen if you suspend the GPU with a lingering process in it so any system with nvidia as its primary gpu is doomed to this curse and those with optimus systems can only work around these stuff by minimizing the lingering processes in their nvidia GPU

MAHBOD-85 avatar May 06 '25 07:05 MAHBOD-85

forget about what i said above, this thing is just borked. this heisenbug cannot stay for any longer so to quote the previous replier FIX IT FOR HEAVEN'S SAKE!!!!!

MAHBOD-85 avatar May 06 '25 14:05 MAHBOD-85

I was using 535.183.01 (no open) for a long time because my distro didn't offer newer version. Few days ago upgraded to 570.133.07 (no open) and first time seeing this kernel: NVRM: failed to allocate vmap() page descriptor table! problem and finding out that this bug has existed almost a YEAR with no fix!

Trying to sleep, I see these on the screen for few seconds before screens turn off:

PM: pci_pm_suspend(): nv_pmops_suspend+0x0/0x40 [nvidia] returns -5 PM: dpm_run_callback(): pci_pm_suspend+0x0/0x150 returns -5 PM: Device 0000:01:00.0 failed to suspend async: error -5 PM: Some devices failed to suspend, or early wake event detected

My laptop power button stays on and fans do not stop (if they are on at that moment). The system comes back normally though if e.g. mouse is moved.

Sleep worked few times before this started to happen. If I remember right, the previous power event was hibernation (worked just fine).

EDIT: For some reason hibernate still works. But I have to go back to the 535 branch. Just when I started to think that Nvidia would be mostly trouble-free now on linux..

EDIT2: Eventually hibernate also stopped working in the same way. Had to downgrade to 535 branch.

Perkolator avatar May 08 '25 00:05 Perkolator