sof icon indicating copy to clipboard operation
sof copied to clipboard

[BUG] DSP panic when testing multiple-pause-resume on LNL/MTL-NOCODEC

Open keqiaozhang opened this issue 2 years ago • 6 comments

Describe the bug Observed this issue on LNL-NOCODEC platform, we had some similar DSP panic issues before, but this one seems to be a new one. The reproduction rate is about 50%.

dmesg

[ 7757.883706] kernel: snd_sof:sof_ipc4_set_pipeline_state: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc4 set pipeline instance 0 state 4
[ 7757.883713] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x13000004|0x0: GLB_SET_PIPELINE_STATE
[ 7757.889677] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc rx      : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[ 7757.889689] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 7757.893295] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: DSP panic!
[ 7757.894681] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_COMPLETE (7)
[ 7757.896605] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM status: 0x5, ROM error: 0x0
[ 7757.898418] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM debug status: 0x0, ROM debug error: 0x0
[ 7757.900502] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ROM feature bit enabled
[ 7757.902203] kernel: snd_sof:sof_ipc4_find_debug_slot_offset_by_type: sof-audio-pci-intel-lnl 0000:00:1f.3: Slot type 0x4c455400 is not available in debug window
[ 7757.902205] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 7757.904205] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-lnl 0000:00:1f.3: fw_state change: 7 -> 8
[ 7757.904234] kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc rx done : 0x1b0a0000|0x0: GLB_NOTIFICATION|EXCEPTION_CAUGHT
[ 7758.384257] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc timed out for 0x13000004|0x0
[ 7758.389367] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: Attempting to prevent DSP from entering D3 state to preserve context
[ 7758.389370] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 7758.391423] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: Host IPC initiator: 0x93000004|0x0|0x0, target: 0x1b0a0000|0x0|0x0, ctl: 0x3
[ 7758.394322] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ------------[ IPC dump end ]------------
[ 7758.396364] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: IPC timeout
[ 7758.397840] kernel: sof-audio-pci-intel-lnl 0000:00:1f.3: ASoC: error at soc_dai_trigger on SSP0 Pin: -110
[ 7758.400062] kernel:  Port0: ASoC: error at dpcm_be_dai_trigger on Port0: -110
[ 7758.401650] kernel:  Port0: ASoC: trigger FE cmd: 4 failed: -110

To Reproduce ~/sof-test/test-case/multiple-pause-resume.sh -r 50

Reproduction Rate About 50%.

Environment

  1. Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
  2. Name of the topology file
    • Topology: {development/sof-lnl-nocodec.tplg}
  3. Name of the platform(s) on which the bug is observed.
    • Platform: {LNLM_RVP_NOCODEC}

dmesg.txt mtrace.txt

keqiaozhang avatar Dec 20 '23 02:12 keqiaozhang

I found another pause-resume failure with more logs (cause no panic). I don't know whether it's related but just in case:

  • #8729

marc-hb avatar Jan 11 '24 21:01 marc-hb

More pause-resume failures. There is one failure in average in every PR test run

  • https://sof-ci.01.org/softestpr/PR1150/build199/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
  • https://sof-ci.01.org/sofpr/PR8754/build1979/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50 (#8754)
  • https://sof-ci.01.org/sofpr/PR8755/build1982/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50 (#8754)
  • https://sof-ci.01.org/sofpr/PR8722/build2097/devicetest/index.html?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50

No panic in the logs but also failed. EDIT: probably #8770

  • https://sof-ci.01.org/softestpr/PR1155/build200/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50
  • https://sof-ci.01.org/sofpr/PR8781/build2188/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=multiple-pause-resume-50

marc-hb avatar Jan 18 '24 01:01 marc-hb

A reproduction in CI daily test:37267

keqiaozhang avatar Jan 24 '24 06:01 keqiaozhang

A reproduction on MTL-NOCODEC: https://sof-ci.ostc.intel.com/#/result/planresultdetail/37340?model=TGLU_RVP_NOCODEC-ipc4&testcase=multiple-pause-resume-50.

keqiaozhang avatar Jan 25 '24 05:01 keqiaozhang

@keqiaozhang does this issue still reproduce? If so, I would like to ask for fresh logs with payloads from LNL

wszypelt avatar Feb 28 '24 10:02 wszypelt

The last time I saw this issue in CI was last week, it seems that the reproduction rate is lower than before, but I believe that this issue still exists.

Please refer to:https://sof-ci.ostc.intel.com/#/result/planresultdetail/38161?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50

keqiaozhang avatar Feb 28 '24 12:02 keqiaozhang

@fredoh9 @marc-hb Fix merged. This issue was pretty rare (mostly because other unrelated issues make the test fail, and it's also pretty rare by itself), so getting no repro doesn't really prove anything. Yet, I'm pretty sure it should be fixed now. Should we close it anyway and reopen if it somehow happens again?

tobonex avatar Apr 26 '24 08:04 tobonex

I haven't seen any such panic for weeks. That's all I know.

Fix merged

Did you mean #9020?

marc-hb avatar Apr 26 '24 19:04 marc-hb

Did you mean #9020?

Yes

tobonex avatar May 08 '24 10:05 tobonex

Should be fixed. Closing for now. Reopen if it's observed again.

tobonex avatar May 10 '24 08:05 tobonex