open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

dkms incorrectly builds updated kernels with the current kernel's source

Open josephtingiris opened this issue 11 months ago • 2 comments

NVIDIA Open GPU Kernel Modules Version

nvidia-open/570.86.15

Operating System and Version

Fedora 41

Kernel Release

Linux d0 6.12.15-200.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Feb 18 15:24:05 UTC 2025 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Build Command

dnf upgrade (which invokes dkms build under the running kernel)

Terminal output/Build Log

Notice it enters the 'wrong' /usr/src/kernels directory, ie:

DKMS (dkms-3.1.5) make.log for nvidia-open/570.86.15 for kernel 6.12.15-200.fc41.x86_64 (x86_64)
Fri Feb 21 09:50:50 AM EST 2025
Cleaning build area
# command: 'make' clean
make -C src/nvidia clean
make[1]: Entering directory '/var/lib/dkms/nvidia-open/570.86.15/build/src/nvidia'
rm -f -rf _out/Linux_x86_64
make[1]: Leaving directory '/var/lib/dkms/nvidia-open/570.86.15/build/src/nvidia'
make -C src/nvidia-modeset clean
make[1]: Entering directory '/var/lib/dkms/nvidia-open/570.86.15/build/src/nvidia-modeset'
rm -f -rf _out/Linux_x86_64
make[1]: Leaving directory '/var/lib/dkms/nvidia-open/570.86.15/build/src/nvidia-modeset'
make -C kernel-open clean
make[1]: Entering directory '/var/lib/dkms/nvidia-open/570.86.15/build/kernel-open'
rm -f -r conftest
make[2]: Entering directory '/usr/src/kernels/6.12.10-200.fc41.x86_64'
make[2]: Leaving directory '/usr/src/kernels/6.12.10-200.fc41.x86_64'
make[1]: Leaving directory '/var/lib/dkms/nvidia-open/570.86.15/build/kernel-open'
...

More Info

I landed here because of nvidia-open/570.86.15 issues on Fedora 41. When dnf upgrade is run, and it installs a new kernel, then dkms will automatically build nvidia-open under the running kernel (without explicitly setting a value for KERNEL_UNAME). The currently included Makefile is being used.

That is evident in the /var/lib/dkms/nvidia-open/*/*/*/log/make.log file(s).

Subsequent reboots fail to load the module, eg:

kernel: nvidia: version magic '6.12.10-200.fc41.x86_64 SMP preempt mod_unload ' should be '6.12.11-200.fc41.x86_64 SMP preempt mod_unload '

Here's a short work-around (for Fedora) that leverages KERNEL_UNAME and will properly build nvidia-open for the 'latest' kernel.

After a dnf upgrade, a variation of following will work as root.

# do this as root
sudo su -

# fyi, version sort installed kernels
rpm -qa kernel | sed -e 's/^kernel-//g' | sort -uV

# fyi, version sort installed dkms module/module-version and kernel/arch
dkms status | sort -uV

export CURRENT_KERNEL="$(uname -r)"; echo "CURRENT_KERNEL=${CURRENT_KERNEL}"
export LATEST_KERNEL="$(rpm -qa kernel | sed -e 's/^kernel-//g' | sort -uV | tail -1)"; echo LATEST_KERNEL=${LATEST_KERNEL} # this matches uname -r

# example rebuild; pick one
export KERNEL_UNAME="6.12.11-200.fc41.x86_64"
export KERNEL_UNAME=${CURRENT_KERNEL}
export KERNEL_UNAME=${LATEST_KERNEL}
echo KERNEL_UNAME=${KERNEL_UNAME}

# set proper values for dkms build, install, etc
export DKMS_ARCH="$(dkms status | grep ${KERNEL_UNAME}, | awk -F, '{print $3}' | awk '{print $1}' | awk -F: '{print $1}')"
export DKMS_KERNEL="$(dkms status | grep ${KERNEL_UNAME}, | awk -F, '{print $2}' | awk '{print $1}')" # should be the same as KERNEL_UNAME
export DKMS_MODULE_VERSION="$(dkms status | grep ${KERNEL_UNAME}, | awk -F, '{print $1}' | awk '{print $1}')"

# manually verify values
echo DKMS_ARCH=${DKMS_ARCH}
echo DKMS_KERNEL=${DKMS_KERNEL}
echo DKMS_MODULE_VERSION=${DKMS_MODULE_VERSION}

# NOTICE! Using the LATEST_KERNEL value is easiest/safest.
# NOTICE! If you're booted with the latest kernel and the modules ARE NOT loaded, then properly rebuilding may immediately load the correct signed module and will likely reset a graphical session.
# IMPORTANT! If you're booted from the kernel you want to 'fix' then do this in a tmux, screen, or from the linux console.
KERNEL_UNAME=${DKMS_KERNEL} dkms uninstall ${DKMS_MODULE_VERSION} -k ${DKMS_KERNEL}/${DKMS_ARCH}
KERNEL_UNAME=${DKMS_KERNEL} dkms build ${DKMS_MODULE_VERSION} -k ${DKMS_KERNEL}/${DKMS_ARCH} --force
KERNEL_UNAME=${DKMS_KERNEL} dkms install ${DKMS_MODULE_VERSION} -k ${DKMS_KERNEL}/${DKMS_ARCH}
KERNEL_UNAME=${DKMS_KERNEL} dkms status ${DKMS_MODULE_VERSION} -k ${DKMS_KERNEL}/${DKMS_ARCH}

# verify the build make.log enters the correct /usr/src/kernels directory ...
less /var/lib/dkms/nvidia-open/*/${DKMS_KERNEL}/${DKMS_ARCH}/log/make.log

systemctl reboot

Using uname -r in the Makefile seems a bit too convenient for a dynamic kernel module.

In the meantime, I posted the workaround here hoping it would help. Thanks

josephtingiris avatar Feb 21 '25 16:02 josephtingiris

FYI @josephtingiris, I "transferred"/transcribed this issue to https://github.com/NVIDIA/yum-packaging-nvidia-driver/issues/10

kmittman avatar Feb 21 '25 17:02 kmittman

Thanks @kmittman, I have more package comments to add & will do so over there. Although, I think the easiest fix is a small change to the packaged kernel-open/Makefile which appears to be in this repo.

josephtingiris avatar Feb 21 '25 18:02 josephtingiris