MMC problems on Pi3 after change to upstream driver
Describe the bug
I've got at least 2 customers who reported a crash with the latest kernels. It happens only on the Pi3. One of my customers can reproduce it by pulling the power. Within a few attempts the system runs into filesystem problems that start with the message:
sdhost-bcm2835: ... unexpected command 25 error
sdhost-bcm2835: ... previous command (25) not complete
sdhost-bcm2835: ... previous command (25) not complete
...
Followed by a bunch of filesystem related errors about journal problems and superblock updates.
Now, I wrote a ticket before (#6981), which I also closed myself as I thought it was a filesystem bug. And as I was not able to reproduce it myself (I even bought a 3A just to be sure I got the same hardware setup as my customer), so I closed the ticket and moved on.
Recently another customer reported the same issue and it bothered me enough to start searching again. Because one of my customers is able to reproduce it 100% I started bisecting stuff. Everything was ok until 6.12.30, and after that things went wrong.
Some more digging pointed me to this post: https://forums.raspberrypi.com/viewtopic.php?t=389169
Which describes exactly the problem as I'm seeing it, except that I don't have this weird upstream/downstream mix: my kernel is downstream only.
But then I noticed the remark about the change to the upstream MMC driver, specifically this commit: https://github.com/raspberrypi/linux/commit/cb6b027dcd8c3bc50009e6c653fd84cc76f49369
Long story short: reverting this commit fixes the issue.
Steps to reproduce the behaviour
Pull the plug, and within a few attempts (normally between 1 and 5) the issue appears.
Device (s)
Raspberry Pi 3 Mod. A+
System
6.12.42
custom built OS with buildroot, based on downstream defconfig with very little changes.
Logs
No response
Additional context
No response
@spockfish Since switching back isn't a solution, it would be nice to investigate the root cause.
Could you please describe a scenario, how to reproduce this issue? When does it happen (during boot, on high load, being idle, ...)? Which SD card was used?
-
happening during boot. it happens on a random partition (both EXT4), and there's doesn't seem to be a particular pattern. It can be triggered by pulling the power cord, and after that booting up. Within 5 attempts it fails.
-
This is done on a system with (virtually) no write action(s) to the SD card.
-
Transcend Ultimate (600x) 8GB and Kensington Industrial Grade 8GB MicroSD both result in failure. Oh. And an unidentified 16GB Sandisk also failed.
#7029 is a reversion to the downstream driver until we understand this issue more.
Is there anything I can do to help out? Although I can't reproduce it myself, my customer that can reproduce it 100% is very helpful and is willing (and has already done) to do additional testing.
It would help me to understand the symptoms a bit more.
Simply pulling the plug on a running system is asking for trouble, but its not clear whether you are seeing corruption or a transient fault. Are you saying that there is some chance (20% say) that at any boot the card might unusable in some way, but it might work after trying again? Or that after 5-6 attempts the card is effectively broken/corrupted?
BTW, I don't see the linked forum thread as being the same issue at all - that was about a mismatched set of software.
We're talking about RoPieee, a media streamer solution for the Pi. It runs in RAM, and only writes stuff to the SD card when changing a config setting (which in this case is not what the customer is doing). The OS is built with buildroot, boots with u-boot and then starts a service (like Roon Bridge for example). The kernel is custom, based on your kernel (latest stable, 6.12.x in this case) with a very minimal set of changes on top of the standard defconfig. Think of setting a specific governor, preemptive, scheduler and disabling a few things that are enabled in the defconfig and simply not needed. So fairly standard, and certainly no fancy stuff.
Simply pulling the plug on a running system is asking for trouble, but its not clear whether you are seeing corruption or a transient fault.
Agreed. And the logs show that there's a FS (Ext4) corruption (which can be fixed with a FSCK), after the MMC unexpected errors. That's why I originally closed the first issue again, because I thought that it was actually a filesystem issue (I didn't look close enough to the MMC errors). But the fact that this is simply reproducible (100%) and fixable (also 100%) is what makes me at least think this is not a FS issue, but something related to the MCC part.
BTW, I don't see the linked forum thread as being the same issue at all - that was about a mismatched set of software.
I should have been more precise in my language: the reported error (MMC unexpected command error 25) is exactly the same. That's what triggered me looking into the change to the upstream MMC driver, which then resulted in a simple "I really don't understand why, but let's try this with this simple revert".
@spockfish Could your customer please provide a dump of the Debug UART (starting from power on until the MMC errors)?
I've have the suspicion that the handover from U-Boot to Linux doesn't work properly. Interestingly the handover from U-Boot to Upstream Linux has been working for a long time.
Which DTB does the customer use for U-Boot? U-Boot ones? Downstream Linux or Upstream Linux?
Could your customer please provide a dump of the Debug UART (starting from power on until the MMC errors)?
I don't see that happen. My customer is no computer nerd whatsoever...
Which DTB does the customer use for U-Boot? U-Boot ones? Downstream Linux or Upstream Linux?
RoPieee is all downstream, including the DTB used during booting.
Thanks for all of the additional details.
BTW, I don't see the linked forum thread as being the same issue at all - that was about a mismatched set of software. ... the reported error (MMC unexpected command error 25) is exactly the same
I suppose that if the result of that mismatched software was running the upstream driver, then it could be another manifestation of the same underlying bug.
I've have the suspicion that the handover from U-Boot to Linux doesn't work properly.
Having two differences - U-boot and the power cycling - does make the issue more complicated.
First of all, I'm not aware of any clear differences between the two drivers that could account for this. It's a shame it's not the downstream driver which is problematic because that has extensive logging facilities.
A few more questions:
- Have users who experienced the problem had more luck with other cards? (I'm not trying to blame the cards, just narrow down the necessary preconditions)
- Briefly power-cycling a device could leave it in a different state to switching it off for 10 seconds or more. Software should cope with both (unless it breaks the power-on reset hardware), but in your testing how long were the power outages?
Further debugging revealed the culprit: a Linux kernel patch developed by the Raspberry Pi team.
Yes, but no.
Yes, but no.
Yeah... hmmm... reading this again that sounds like I want to blame you, which is certainly not the case. Corrected and sorry about that.
Have users who experienced the problem had more luck with other cards? (I'm not trying to blame the cards, just narrow down the necessary preconditions)
We tried with 3 cards, that were also known to work 'in the past' (as in: the previous version).
Briefly power-cycling a device could leave it in a different state to switching it off for 10 seconds or more. Software should cope with both (unless it breaks the power-on reset hardware), but in your testing how long were the power outages?
For the user to be able to see the issue he needs to wait before RoPieee has fully started, which is somewhere between half a minute or so. I'll ask him in how much time he takes between power off and power on again.
What is also difficult that it is rare: RoPieee has a fairly modest user base (a few thousand users), and I know only 2 users that actually reported it. Now, the software is capable of recovering from it (reboot, which then forces an fsck), but still.
Here's a screenshot from the second user:
Thinking about this why I can't reproduce it... those 2 users (and also the one that can reproduce it 100%) their units are during installation. In that phase there is actually write ops to SD (configuration), and then the issue appears.
I'll write a test script that does a lot of write actions and see if that makes it reproducible for me as well. That would make things way easier (thinking of the U-Boot console logging for example).
I'm trying to install on a 3B+, and after the early stages of installation I'm not getting a display on my HDMI monitor (a DELL - 4K but nothing exotic):
[ 6.148291] [drm] Initialized vc4 0.0.0 for soc:gpu on minor 0
[ 6.154319] vc4-drm soc:gpu: [drm] Cannot find any crtc or sizes
If I let it continue it at some point rewrites/switches away from my modified cmdline.txt so I lose UART output from the kernel. So I'm left with a blank/off display and the UART showing "Starting kernel ..." from U-boot. Meanwhile the ACT LED flashes slowly.
Yeah :-) RoPieee disables hdmi when up and running. Remember, this is for us crazy hifi nerds who think that HDMI can interfere with sound ;-)
If you go to the webpage (ropieee.local) and to the 'advanced' tab, you can switch the update channel to 'beta'. With that SSH access is enabled (root, ropieee).
Meanwhile the ACT LED flashes slowly.
That's correct. As RoPieee is meant to be used unattended, the LED is used as an indicator that everything is OK.
Thanks, I hadn't twigged that the webpage was the only means of accessing it.
After 10 power-cycles I haven't got it to fail. Let me know if you succeed in this regard.