snd_hda_core freezes system on warm boot or causes 100% cpu usage
Describe the bug On a warm boot the snd_hda_core module cause the system to freeze completely and if it boots it causes a kworker to use 100% CPU>
I did some debugging using sysrq-trigger: [ 1075.378322] Hardware name: HP HP Laptop 15s-fq4xxx/89BC, BIOS F.31 03/25/2023 [ 1075.378323] Workqueue: events sof_probe_work [snd_sof] [ 1075.378335] RIP: 0010:snd_hdac_bus_send_cmd+0x5b/0xb0 [snd_hda_core] [ 1075.378345] Code: a3 89 a9 d8 03 00 00 48 8b 43 20 48 83 c0 48 66 8b 00 0f b7 c0 3d ff ff 00 00 74 54 83 c0 01 48 8b 73 20 0f b6 c0 66 8b 56 4a <0f> b7 d2 39 d0 74 30 83 81 94 03 00 00 01 48 8b 93 28 03 00 00 89 [ 1075.378346] RSP: 0018:ffffc9000039bb68 EFLAGS: 00000002 [ 1075.378347] RAX: 0000000000000007 RBX: ffff88810250b030 RCX: ffff88810250b040 [ 1075.378348] RDX: 0000000000000007 RSI: ffffc90001650000 RDI: ffff88810250b4d8 [ 1075.378348] RBP: 00000000401f0005 R08: ffff88810250b430 R09: ffffc9000039bb40 [ 1075.378349] R10: ffff888121a4ec00 R11: 00000000fbef7bd5 R12: 0000000000000004 [ 1075.378350] R13: ffff88810250b4d8 R14: ffffc9000039bc7c R15: ffff88810250b4e0 [ 1075.378350] FS: 0000000000000000(0000) GS:ffff888277ac0000(0000) knlGS:0000000000000000 [ 1075.378351] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1075.378352] CR2: 00007fe5d2250000 CR3: 0000000105bd4002 CR4: 0000000000770ee0 [ 1075.378353] PKRU: 55555554 [ 1075.378353] Call Trace: [ 1075.378355] <NMI> [ 1075.378355] ? nmi_cpu_backtrace+0x95/0x110 [ 1075.378358] ? nmi_cpu_backtrace_handler+0xd/0x20 [ 1075.378359] ? nmi_handle+0x5b/0x150 [ 1075.378361] ? default_do_nmi+0x42/0x1e0 [ 1075.378364] ? exc_nmi+0x1a9/0x240 [ 1075.378366] ? end_repeat_nmi+0x16/0x67 [ 1075.378367] ? snd_hdac_bus_send_cmd+0x5b/0xb0 [snd_hda_core] [ 1075.378374] ? snd_hdac_bus_send_cmd+0x5b/0xb0 [snd_hda_core] [ 1075.378380] ? snd_hdac_bus_send_cmd+0x5b/0xb0 [snd_hda_core] [ 1075.378386] </NMI> [ 1075.378386] <TASK> [ 1075.378387] snd_hdac_bus_exec_verb_unlocked+0x75/0x190 [snd_hda_core] [ 1075.378393] snd_hdac_bus_exec_verb+0x3a/0x60 [snd_hda_core] [ 1075.378399] hda_reg_read+0x1b7/0x250 [snd_hda_core] [ 1075.378405] snd_hdac_regmap_read_raw+0x67/0xe0 [snd_hda_core] [ 1075.378412] snd_hdac_device_init+0x204/0x420 [snd_hda_core] [ 1075.378419] snd_hda_codec_device_init+0xad/0x280 [snd_hda_codec] [ 1075.378431] ? __kmalloc_node_track_caller+0x8c/0x1a0 [ 1075.378433] hda_codec_probe_bus+0x16d/0x300 [snd_sof_intel_hda] [ 1075.378439] ? sdw_intel_acpi_scan+0x11d/0x1e0 [snd_intel_sdw_acpi] [ 1075.378443] hda_dsp_probe+0x462/0x760 [snd_sof_intel_hda_common] [ 1075.378452] sof_probe_work+0x2c/0x3d0 [snd_sof] [ 1075.378459] process_one_work+0x1d9/0x3d0 [ 1075.378462] worker_thread+0x4d/0x480 [ 1075.378464] ? __pfx_worker_thread+0x10/0x10 [ 1075.378466] kthread+0xd6/0x100 [ 1075.378469] ? __pfx_kthread+0x10/0x10 [ 1075.378470] ret_from_fork+0x29/0x50 [ 1075.378473] </TASK>
Reproduction Rate Every warm boot
Expected behavior I expect it to load the audio driver without crashing.
** Kernel driver It seems to use this snd_sof_pci_intel_tgl driver and it reports ALC236 for its codec.
Impact It makes the system unusable
Name of the topology file: Topology: There is no mentioning of any topology file in the kernel log.
Name of the platform(s) on which the bug is observed. Ubuntu 22.04, Arch Linux, Debian 12
I don´t mention a kernel because it happens on every kernel i have tried from 5.15 to 6.3.9.
I did find a workaround: options snd-intel-dspcfg dsp_driver=1 // Switch to legacy driver options snd-hda-intel model=dell-headset-multi
The model part is very important, without adding this model the system will still freeze.
Thanks @nicktelindert , let me move this to Linux driver component
@nicktelindert It sounds the codec driver for your system is missing support for the hardware as you need to pass the model=dell-headset-multi option. The driver should not freeze the system though, so this is pretty bad. Could you enable dynamic debug and grab a kernel dmesg when this happens:
https://thesofproject.github.io/latest/getting_started/intel_debug/suggestions.html#enable-dynamic-debug
I did find a workaround: options snd-intel-dspcfg dsp_driver=1 // Switch to legacy driver options snd-hda-intel model=dell-headset-multi
The model part is very important, without adding this model the system will still freeze.
That points clearly to a codec driver issue, with dsp_driver=1 the SOF driver is not used at all and the system still freezes.
@nicktelindert any updates or information to share for us to root-cause this problem?
Here is some more logging without using dsp_driver=1
Jul 16 07:02:11 jammy kernel: watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [kworker/6:2:308]
...
Jul 16 07:02:11 jammy kernel: CPU: 6 PID: 308 Comm: kworker/6:2 Tainted: G W 5.15.0-76-generic #83-Ubuntu
Jul 16 07:02:11 jammy kernel: Hardware name: HP HP Laptop 15s-fq4xxx/89BC, BIOS F.26 10/18/2022
Jul 16 07:02:11 jammy kernel: Workqueue: events sof_probe_work [snd_sof]
Jul 16 07:02:11 jammy kernel: RIP: 0010:snd_hdac_bus_send_cmd+0xdf/0x130 [snd_hda_core]
Jul 16 07:02:11 jammy kernel: Code: 1f 44 00 00 31 c0 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc 4c 89 f7 c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00
I am not sure about the root cause, the only thing i know is that blacklisting snd_hda_core prevents the system from freezing. As long as i never reboot the system it never freezes, so it only happens after a warm boot.
@kv2019i @ujfalusi wasn't there a similar issue recently with the iDISP codec turned off too early?
Yes, there were an issue on reboot, but this is during or after 'warm boot', which I'm not sure what is..
We did some debugging with the Ubuntu community and we figured this out:
Cold boot jammy kernel: snd_sof_intel_hda:hda_codec_probe: sof-audio-pci-intel-tgl 0000:00:1f.3: HDA codec #0 probed OK: response: 10ec0236 jammy kernel: snd_sof_intel_hda:request_codec_module: hdaudio ehdaudio0D0: loading codec module: hdaudio:v10EC0236r00100002a01 jammy kernel: snd_sof_intel_hda:hda_codec_probe: sof-audio-pci-intel-tgl 0000:00:1f.3: HDA codec #2 probed OK: response: 80862812 jammy kernel: snd_sof_intel_hda:request_codec_module: hdaudio ehdaudio0D2: loading codec module: hdaudio:v80862812r00100000a01
warm boot(Boot after reboot) jammy kernel: snd_sof_intel_hda:hda_codec_probe: sof-audio-pci-intel-tgl 0000:00:1f.3: HDA codec #0 probed OK: response: fe05 <then nothing for 26 sec until> jammy kernel: watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/2:1:92]
So it seems the HDA codec probe should return a PCI ID and for some reason after a warm boot it does not.
And if the PCI ID is not returned the system will hang on boot.
Credits to: brett hassall (brett-hassall) Reference: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2027848
@nicktelindert is there any version of the kernel where this problem does not show? It's not clear to me if this is a regression or something that never worked on your system.
It is a regression, it seems like the last kernel that worked for me was the 5.12 kernel.
wow, that's 2.5+ years old. is there a way you can bisect and find the first patch that broke your system? It's really the first time we hear about such errors on what looks like a bog-standard HDaudio device, so our ability to provide guidance is very limited at this point.
@nicktelindert, 5.12 is kind of old and a regression of this scale should have been caught in a span of three years? Would it be possible to at least try to narrow the window? 5.12, 5.19, 6.0, 6.4 and the 6.5?
What is the systemd version? With 254 there is this 'Soft Reboot' feature which might or might not leave devices in a strange state?
It looks like that on reboot some devices (HDA) is left enabled or in an undefined state?
Does logging out from GUI, switching to console and doing a reboot from there works (not initiated from you favorite DE)?
Can you try to stop the GUI session, switch to console, kill PulseAudio/PipeWire, remove the audio modules and then do a reboot?
@ujfalusi I tested more kernels and i am really certain that the problem occurs on every kernel since 5.15.
Removing the following modules before rebooting prevents it from freezing: sudo rmmod snd_sof_pci_intel_tgl sudo rmmod snd_sof_intel_hda_common sudo rmmod snd_hda_intel
The systemd version i used was 249 on a working system but the problem also occures on debian 12 which has version 252. The soft reboot does not seem to be the issue here.
I will do more debugging, for now this was all i had time for today.
@nicktelindert, thanks for the information, I would guess that alone the
sudo rmmod snd_sof_pci_intel_tgl
would be sufficient as it will unbind the sound card and likely cause certain events on the codec side that prepares it for a clean reboot. Is it still a valid workaround to switch to legacy driver stack before reboot as you write in the issue report?
options snd-intel-dspcfg dsp_driver=1 // Switch to legacy driver
options snd-hda-intel model=dell-headset-multi
You also mentioned that the model=dell-headset-multi is important, without it the reboot is not working correctly, is that correct?
Can it be a missing fixup in sound/pci/hda/patch_realtek.c for the device you have?
With the forced model the ALC269_FIXUP_DELL1_MIC_NO_PRESENCE will be applied.
It would likely help if you could try 5.16 kernel for example and also if you would provide the report from alsa-info.sh as suggested in https://thesofproject.github.io/latest/getting_started/intel_debug/suggestions.html?highlight=alsa%20info#run-alsa-info
I'm not sure if there will be differences in alsa-info output if you do it with the SOF stack, legacy stack w/o dell-headset-multi and legacy stack w/ dell-headset-multi, probably not, but if you can attach these three, it might help.
Another thing to try with the SOF stack:
options snd-sof-intel-hda-common hda_model=dell-headset-multi
[ 5.247059] sof-audio-pci-intel-tgl 0000:00:1f.3: DSP detected with PCI class/subclass/prog-if info 0x040100 [ 5.247154] sof-audio-pci-intel-tgl 0000:00:1f.3: Digital mics found on Skylake+ platform, using SOF driver [ 5.247173] sof-audio-pci-intel-tgl 0000:00:1f.3: enabling device (0000 -> 0002) [ 5.247366] sof-audio-pci-intel-tgl 0000:00:1f.3: DSP detected with PCI class/subclass/prog-if 0x040100 [ 5.247433] sof-audio-pci-intel-tgl 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915]) [ 5.254317] sof-audio-pci-intel-tgl 0000:00:1f.3: use msi interrupt mode [ 5.279793] sof-audio-pci-intel-tgl 0000:00:1f.3: hda codecs found, mask 5 [ 5.279797] sof-audio-pci-intel-tgl 0000:00:1f.3: using HDA machine driver skl_hda_dsp_generic now [ 5.279799] sof-audio-pci-intel-tgl 0000:00:1f.3: DMICs detected in NHLT tables: 2 [ 5.283683] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:2:0-57864 [ 5.283687] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware: ABI 3:22:1 Kernel ABI 3:23:0 [ 5.283694] sof-audio-pci-intel-tgl 0000:00:1f.3: unknown sof_ext_man header type 3 size 0x30 [ 5.381179] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware info: version 2:2:0-57864 [ 5.381187] sof-audio-pci-intel-tgl 0000:00:1f.3: Firmware: ABI 3:22:1 Kernel ABI 3:23:0 [ 5.392095] sof-audio-pci-intel-tgl 0000:00:1f.3: Topology: ABI 3:22:1 Kernel ABI 3:23:0 [ 5.392249] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: Parent card not yet available, widget card binding deferred [ 5.410767] snd_hda_codec_realtek ehdaudio0D0: autoconfig for ALC236: line_outs=1 (0x14/0x0/0x0/0x0/0x0) type:speaker [ 5.410771] snd_hda_codec_realtek ehdaudio0D0: speaker_outs=0 (0x0/0x0/0x0/0x0/0x0) [ 5.410772] snd_hda_codec_realtek ehdaudio0D0: hp_outs=1 (0x21/0x0/0x0/0x0/0x0) [ 5.410773] snd_hda_codec_realtek ehdaudio0D0: mono: mono_out=0x0 [ 5.410774] snd_hda_codec_realtek ehdaudio0D0: inputs: [ 5.410775] snd_hda_codec_realtek ehdaudio0D0: Headset Mic=0x19 [ 5.410776] snd_hda_codec_realtek ehdaudio0D0: Headphone Mic=0x1a [ 5.459726] snd_hda_codec_realtek ehdaudio0D0: ASoC: sink widget AIF1TX overwritten [ 5.459733] snd_hda_codec_realtek ehdaudio0D0: ASoC: source widget AIF1RX overwritten [ 5.459853] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: sink widget hifi3 overwritten [ 5.459883] skl_hda_dsp_generic skl_hda_dsp_generic: ASoC: source widget Alt Analog Codec Capture overwritten [ 5.459891] skl_hda_dsp_generic skl_hda_dsp_generic: hda_dsp_hdmi_build_controls: no PCM in topology for HDMI converter 3 [ 5.478784] input: sof-hda-dsp Headphone Mic as /devices/pci0000:00/0000:00:1f.3/skl_hda_dsp_generic/sound/card0/input15 [ 5.478834] input: sof-hda-dsp HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:1f.3/skl_hda_dsp_generic/sound/card0/input16 [ 5.478872] input: sof-hda-dsp HDMI/DP,pcm=4 as /devices/pci0000:00/0000:00:1f.3/skl_hda_dsp_generic/sound/card0/input17 [ 5.478909] input: sof-hda-dsp HDMI/DP,pcm=5 as /devices/pci0000:00/0000:00:1f.3/skl_hda_dsp_generic/sound/card0/input18
@nicktelindert the last line in the log doesn't sound so good
[ 186.235105] snd_hda_codec:__snd_hda_apply_fixup: snd_hda_codec_realtek ehdaudio0D0: ALC236: Apply fix-func for (null)
Clear a bad pointer somewhere.
@plbossart, I think it is a 'feature' or shall I say bug in sound/pci/hda/hda_auto_parser.c, if the fixup is found via quirk and CONFIG_SND_DEBUG_VERBOSE is not set then the fixup_name is set to NULL.
I wonder why we don't have ehdaudio0D0: ALC236: picked fixup for .... printed in the log.
@nicktelindert, can you share the full kernel log?