open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Error launching Vulkan/DXVK games in Flatpak

Open YanEx13 opened this issue 1 year ago • 10 comments

NVIDIA Open GPU Kernel Modules Version

560.35.03

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Fedora Linux 40 (KDE Plasma)

Kernel Release

6.10.6-200.fc40.x86_64

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3060 Laptop GPU

Describe the bug

When launching games via Steam/Heroic Launcher/Bottles Flatpak, they fail to start. However, in the native clients for Steam/Heroic, everything works fine. This issue occurs only with games running on Vulkan/DXVK, while games using OpenGL run without problems in Flatpak. Interestingly, if you run 'vulkaninfo', the games will start correctly until the next reboot.

To Reproduce

  1. Launch Steam/Heroic Flatpak
  2. Launch any game using Vulkan/DXVK
  3. Get an error

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

As it turned out, this issue affects not only Fedora but also Arch, specifically CachyOS. And if you connect the laptop to an external monitor, the issue is also absent. There is no such problem in the X11 session, below are links to my problem:

https://forums.developer.nvidia.com/t/error-couldn-t-switch-to-requested-monitor-resolution/291376 https://bugzilla.redhat.com/show_bug.cgi?id=2297228 https://bugzilla.rpmfusion.org/show_bug.cgi?id=7004

YanEx13 avatar Aug 26 '24 21:08 YanEx13

the steam valve game launcher also has an issue with Vulcan support with the driver p2p installed. to reproduce install steam then counterstrike 2. try to load game and says Vulcan driver support not present.if I compile and install the Vulcan sdk it can't communicate with video cards

mylesgoose avatar Sep 01 '24 01:09 mylesgoose

If you enable the nvidia-persistenced daemon by running:
sudo systemctl enable nvidia-persistenced --now, it seems to resolve this issue.

Additionally, you no longer need to add
__NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only __GLX_VENDOR_LIBRARY_NAME=nvidia to the launch parameters to force games to use the Nvidia GPU instead of Intel.

However, I don't fully understand the exact purpose of this daemon. Moreover, it's not required in other desktop environments, where it's disabled by default.

YanEx13 avatar Jan 28 '25 21:01 YanEx13

Hey, sorry for lack of response earlier.

However, I don't fully understand the exact purpose of this daemon.

Much simplified, but: If a driver has no "clients", it is not loaded at all. And even if loaded, if a given GPU is not actively used by anyone, it is powered off and uninitialized by the driver. If your NV GPU is the primary, then the DE is using it and everything gets initialized at startup. But in these optimus cases, the iGPU is used and the dGPU only gets powered on when you run a game.

Where nvidia-persistenced comes into play is that it is a very simple process that just takes a reference to the GPU so that the driver keeps it initialized always. Without it, starting a game can incur a bit of extra init time (roughly a second I'd guess), and then when you exit the game, it all gets deinitialized again (if no other uses).

Similarly, running an external monitor will require initializing it, as will running anything else that requires it (e.g. nvidia-smi -l).

Note that keeping the daemon running will likely mean somewhat worse battery life, as the GPU is now in a deep sleep mode rather than fully shut down. I don't have any numbers here, sorry, but if you test it out please come back and post them.

As for the original bug, it seems to be the case that for whatever reason flatpak+dxvk is unable to initialize the GPU. I wonder if this is a permissions issue with flatpak. From your logs, there is:

nvidia-fallback.service - Fallback to nouveau as nvidia did not load was skipped because of an unmet condition check (ConditionPathExists=!/sys/module/nvidia).

What is unclear to me is whether it is unable to initialize the GPU or the driver itself. Could you please do the following experiment:

  1. Disable nvidia-persistenced and the external monitor and make sure you can repro the original issue
  2. Run nvidia-smi once, make sure it prints the correct GPU/driver info. Make sure the list of processes at the bottom is empty (i.e. nothing is using the GPU)
  3. Run lsmod | grep nvidia and make sure that at least nvidia is loaded (maybe other nvidia_* are too). If not, try modprobe nvidia first and then lsmod again.
  4. Try the repro case

mtijanic avatar Jan 29 '25 08:01 mtijanic

please come back and post them.

Thanks for the explanation! If I run any tests, I'll be sure to report back.

What is unclear to me is whether it is unable to initialize the GPU or the driver itself. Could you please do the following experiment:

nvidia-smi
Wed Jan 29 10:40:37 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77                 Driver Version: 565.77         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   37C    P0             22W /   85W |       2MiB /   6144MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

lsmod | grep nvidia

nvidia_drm            147456  0
nvidia_modeset       1671168  1 nvidia_drm
nvidia_uvm           3989504  0
nvidia              77516800  7 nvidia_uvm,nvidia_modeset
drm_ttm_helper         16384  2 nvidia_drm,xe
video                  81920  3 xe,i915,nvidia_modeset

I disabled the daemon, restarted the system, ran the commands, and then tried to reproduce the issue. And yes, the problem persists.

YanEx13 avatar Jan 29 '25 09:01 YanEx13

Thanks. So the driver is loaded, and it's just the GPU that is off. I do not know why this would make a difference for flatpak, from its point of view it's just an open("/dev/nvidia0") or something either way.

Maybe there is a watchdog that kills the process if the init takes too long, and it does take too long if the GPU is off?

One possible experiment here would be to install the proprietary (non-open) driver and run it with options nvidia NVreg_EnableGpuFirmware=0. This disables GSP, so the GPU init should be somewhat faster (since you don't have to prepare and copy over the GSP firmware image). If this works, then it's evidence it's just the extra delay that is causing the problem, and we can look at where that is checked (flatpak or dxvk) and change the threshold.

mtijanic avatar Jan 29 '25 10:01 mtijanic

I switched to the proprietary driver using:

sudo sh -c 'echo "%_with_kmod_nvidia 1" > /etc/rpm/macros.nvidia-kmod'
sudo akmods --kernels $(uname -r) --rebuild

Now, the output is:

modinfo -l nvidia

NVIDIA

I disabled GSP by modifying /etc/default/grub and adding:

nvidia.NVreg_EnableGpuFirmware=0

The output is:

nvidia-smi --query-gpu=gsp.mode.current --format=csv

gsp.mode.current
[N/A]

Unfortunately, the issue still persists.

YanEx13 avatar Jan 29 '25 11:01 YanEx13

Thanks for the test! Unfortunately, I'm all out of ideas. I don't know why flatpak would ever notice a difference here.

The only other suggestion I have is to link the above few post to the flatpak and flathub issues you linked above and hopefully someone who understands that side of things has better ideas. Sorry.

I'm pretty sure this is not a kernel driver bug, but we can leave this open until we get some insight into what it actually is.

mtijanic avatar Jan 29 '25 12:01 mtijanic

Honestly, I'm leaning towards this being a KDE issue since the same problem doesn't occur with GNOME (haven't tested other DEs). Also, switching from Wayland to X11 makes the issue disappear.

Thanks anyway! I'll post any updates here.

YanEx13 avatar Jan 29 '25 12:01 YanEx13

I wouldn't blame this on the DE just yet. Arguably, KDE is the only one doing it right and not keeping a GPU initialized when it's not needed, thus saving power.

My bet would be either packaging of a particular flatpak lib is wrong, or the overall permissions of flatpaks preventing it somehow. And all the cases where it does work (X11, GNOME, etc) work because something else has already initialized the GPU, just like nvidia-persistenced does.

mtijanic avatar Jan 29 '25 12:01 mtijanic

It looks like you were right — KDE closed my report today, which I submitted on July 20. They suggested reaching out to the Nvidia forum.

https://bugs.kde.org/show_bug.cgi?id=490561

YanEx13 avatar Jan 29 '25 17:01 YanEx13