[BUG] DSP panic when testing multiple-pause-resume on LNL/MTL-NOCODEC
Describe the bug Observed this issue on LNL-NOCODEC platform, we had some similar DSP panic issues before, but this one seems to be a new one. The reproduction rate is about 50%.
dmesg
[ 7757.883706] kernel: snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 0 state 4
[ 7757.883713] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx : 0x13000004|0x0: GLB_SET_PIPELINE_STATE
[ 7757.889677] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc rx : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[ 7757.889689] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 7757.893295] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: DSP panic!
[ 7757.894681] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[ 7757.896605] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[ 7757.898418] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM debug status: 0x0, ROM debug error: 0x0
[ 7757.900502] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM feature bit enabled
[ 7757.902203] kernel: snd_sof:sof_ipc4_find_debug_slot_offset_by_type: sof-audio-pci-intel-lnl 0000:00:1f.3: Slot type 0x4c455400 is not available in debug window
[ 7757.902205] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 7757.904205] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 8
[ 7757.904234] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc rx done : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[ 7758.384257] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc timed out for 0x13000004|0x0
[ 7758.389367] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: Attempting to prevent DSP from entering D3 state to preserve context
[ 7758.389370] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 7758.391423] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: Host IPC initiator: 0x93000004|0x0|0x0, target: 0x1b0a0000|0x0|0x0, ctl: 0x3
[ 7758.394322] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ IPC dump end ]------------
[ 7758.396364] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: IPC timeout
[ 7758.397840] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ASoC: error at soc_dai_trigger on SSP0 Pin: -110
[ 7758.400062] kernel: Port0: ASoC: error at dpcm_be_dai_trigger on Port0: -110
[ 7758.401650] kernel: Port0: ASoC: trigger FE cmd: 4 failed: -110
To Reproduce ~/sof-test/test-case/multiple-pause-resume.sh -r 50
Reproduction Rate About 50%.
Environment
- Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
- Kernel: {faf00564804b}
- SOF: {4f028c5ca34b}
- Name of the topology file
- Topology: {development/sof-lnl-nocodec.tplg}
- Name of the platform(s) on which the bug is observed.
- Platform: {LNLM_RVP_NOCODEC}
I found another pause-resume failure with more logs (cause no panic). I don't know whether it's related but just in case:
- #8729
More pause-resume failures. There is one failure in average in every PR test run
- https://sof-ci.01.org/softestpr/PR1150/build199/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
- https://sof-ci.01.org/sofpr/PR8754/build1979/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50 (#8754)
- https://sof-ci.01.org/sofpr/PR8755/build1982/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50 (#8754)
- https://sof-ci.01.org/sofpr/PR8722/build2097/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50
No panic in the logs but also failed. EDIT: probably #8770
- https://sof-ci.01.org/softestpr/PR1155/build200/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
- https://sof-ci.01.org/sofpr/PR8781/build2188/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
A reproduction in CI daily test:37267
A reproduction on MTL-NOCODEC: https://sof-ci.ostc.intel.com/#/result/planresultdetail/37340?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50.
@keqiaozhang does this issue still reproduce? If so, I would like to ask for fresh logs with payloads from LNL
The last time I saw this issue in CI was last week, it seems that the reproduction rate is lower than before, but I believe that this issue still exists.
Please refer to:https://sof-ci.ostc.intel.com/#/result/planresultdetail/38161?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50
@fredoh9 @marc-hb Fix merged. This issue was pretty rare (mostly because other unrelated issues make the test fail, and it's also pretty rare by itself), so getting no repro doesn't really prove anything. Yet, I'm pretty sure it should be fixed now. Should we close it anyway and reopen if it somehow happens again?
I haven't seen any such panic for weeks. That's all I know.
Fix merged
Did you mean #9020?
Did you mean #9020?
Yes
Should be fixed. Closing for now. Reopen if it's observed again.