ARMCMx: fix cache corruption when booting from PX4 bootloader
Problem: The PX4 bootloader leaves the instruction and data caches in an inconsistent state when jumping to the application. The default __cpu_init() calls SCB_EnableDCache(), which performs invalidate-by-set/way operations. Since the stack is already in use and potentially cached, this invalidation corrupts cached stack data, causing the application to crash on return from __cpu_init() with PC loaded as 0x00000000.
Root cause sequence:
- Bootloader leaves caches in unknown state with stale data
- ChibiOS _crt0_entry sets up stack in cached DTCM RAM
- __cpu_init() executes with function prologue (push {r4, lr})
- SCB_EnableDCache() invalidates cache by set/way, zeroing stack data
- Function epilogue (pop {r4, pc}) loads corrupted PC → crash
Solution:
- Disable both ICache and DCache in crt0_v7m.S before calling __cpu_init This ensures cache operations work on a clean slate without corrupting active stack data
- Make __cpu_init naked for Cortex-M7 to prevent any stack usage The CMSIS cache enable functions are static inline and will be inlined into the naked function
Tested on CubeOrange (STM32H743) with PX4 bootloader v1.15.0
Side note:
I found this fix and explanation using Claude Code, so it might be arbitrarily wrong. However, it does seem to work.
@bugobliterator
@bugobliterator responding to your questions:
- Have you tried just invalidating the cache instead of disabling before __cpu_init is called?
Cache invalidate-by-set/way IS the problem. When you invalidate DCache while the stack is cached, it zeros out the cached stack data, causing the crash. The SCB_EnableDCache() function does invalidate+enable in one operation. By disabling first, the subsequent invalidate operates on an empty cache and can't corrupt the stack.
- Also if the disable/enable worked, do you still need the __cpu_init with naked attribute?
Yes. Even though we disable caches beforehand, the default __cpu_init() would use the stack (push/pop in function prologue/epilogue). This creates a race condition:
- Function pushes registers to uncached stack
- SCB_EnableDCache() enables cache
- Function pops - might read stale cached data
The naked attribute ensures zero stack operations during the cache state transitions. The CMSIS SCB_EnableICache() and SCB_EnableDCache() functions are __STATIC_INLINE, so they inline directly into the naked function with no overhead.
- It seems like there are three fixes here, are all of them necessary?
There are only two fixes that work together:
- crt0_v7m.S (7 lines): Disable caches before calling __cpu_init
- crt1.c (2 lines): Make __cpu_init naked + explicit return
Both are necessary and complement each other:
- Cache disable alone isn't enough - function would still use stack during cache operations
- Naked function alone isn't enough - bootloader might leave stale data in cache
The alternative would be a single ~60 line naked implementation that does everything inline (disable, invalidate, enable), but that duplicates the well-tested CMSIS cache enable logic. Our approach reuses the existing CMSIS functions while adding the minimal preconditions they need to work safely.
@julianoes you will need to do a patch against ChibiOS proper and get Giovanni to review. We can't simply leave these turned off so I am assuming they are turned on somewhere again later? Either way there are some subtleties in this code that can break a ton of things in non-obvious ways so we need to be uber-careful.
The way we try and do these things is to get them upstreamed first and then pull the change into our branch
@andyp1per Ok, I understand that this fix should go upstream first.
However, I'm not too sure this is something that upstream would want. I'd expect push back and something like "fix your bootloader" which is not wrong but unfortunately wouldn't fix the problem with current boards with already installed bootloaders out there.
All I'm trying to do here is to avoid frustration for ArduPilot users when they flash ArduPilot with a PX4 bootloader installed and suddenly they discover that their board doesn't work anymore. It's in ArduPilot's interest to "just work" with a PX4 bootloader, I would assume.
In terms of subtleties that break something, that's where testing comes in, again not something I can really help. For me "ArduPilot starts" again with this but that's the extent I tested.
I don't know what compelled you to submit an arbitrarily wrong PR (thank you for recognizing that though) instead of a good issue.
In any case, various notes:
- Hopefully I can reproduce on the board and version pair you claim, will try to test
- We do not need to make the init function naked, cache can be safely enabled after invalidation without screwing up the stack, the "race condition" is false
- It's correct that the ChibiOS
SCB_EnableDCacheinvalidates the cache, but the CMSIS docs suggest it doesn't, I wonder if that's a ChibiOS patch - If any bootloader actually leaves the cache in an indeterminate state (e.g. not invalidating it before enabling it) then the system state is corrupt on app start and there's nothing anybody can do, users need to update the bootloader
- If the cache has been properly used, it may need to be cleaned before it's disabled (do we pass stuff between bootloader and app in RAM anywhere?)
- If all bootloaders correctly manage the cache, then it's probably better to just test if the cache is already enabled before enabling it. ChibiOS would probably be more likely to accept such a patch too; I could propose that to them
The tricky thing here is boards using XIP so there is an explicit handoff between bootloader and main fw via RAM. That needs not to be broken so ergo needs to be tested - as do many things. I always recommend that users upload to the AP bootloader since the PX4 bootloader demonstrates many kinds of problem - that is going to be way less effort for us as a team.
This suggests we should just conditionally enable the cache. Then it will have no difference in behavior with an AP bootloader.
I don't know what compelled you to submit an arbitrarily wrong PR (thank you for recognizing that though) instead of a good issue.
Often an issue just sits there, unless someone shares the pain. Therefore, I thought to tackle it using Claude Code in order to come up with a fix/workaround.
If you have a better way to fix it, please do that instead. And if you don't like it or find it risky, then this can be closed and you'll just have to deal with ArduPilot users running into this issue, every now and then.
Again, I'm just trying to help with something that I ran into and left me scratching my head.
I was able to establish the following facts:
- The PX4 1.15.0 bootloader on Cube Orange booting AP 4.6.3 was broken before this patch (it was a real bear to get on there)
- It works after this patch
All of the rest of it is nonsense, however, and determining that has wasted non-trivial time debugging. I don't think the disclaimer is a sufficient excuse. I am personally still disappointed and think this PR and interaction approach is unprofessional and unhelpful.
Regardless, the real cause (determined with zero AI involvement) is that the PX4 bootloader either turns off or never turns on the RAM segment where AP expects the stack to be on this board (unlike the AP bootloader which I verified does). It in fact never turns on the data cache. Therefore, the push in the cache enable function doesn't do anything because the writes go nowhere, and the corresponding return pops zeros and the system dies.
Once we survive that function (as this patch avoids using the stack for it) and the data cache is on, the next few function calls work because the stack data stays in cache instead of going out to RAM, and we are evidently able to run like this until the RAM segment gets fired up and operation becomes normal. But this is clearly unstable and silly to rely on.
As the patch is unrelated to the cause and correct fix, I will close this PR. Please open an issue in the main AP repo that booting does not work on Cube Orange with the PX4 bootloader and link this PR, along with working instructions to flash the bootloader to test. We will address it as time permits.
@tpwrules Thanks for spending time on this.
I am personally still disappointed and think this PR and interaction approach is unprofessional and unhelpful.
I apologize. I assume you must be frustrated by the AI slop being thrown at you. From my side I would have hoped for a response that's a bit more welcoming and grateful, as I was trying to help in an area where I have no horse in the race. I had brought up the issue with @bugobliterator who didn't have time to look at it, so I decided to give it a shot myself. You are right, creating an issue would have been a better idea.
The PX4 1.15.0 bootloader on Cube Orange booting AP 4.6.3 was broken before this patch
Could you elaborate on how it is broken? Given it works with PX4 firmware, I'm not 100% following. I do understand that it doesn't seem to be "compatible" with ArduPilot/ChibiOS, we can agree on that.
But this is clearly unstable and silly to rely on.
Ok, if I understand you correctly, this means that this workaround isn't "a solution", so there would need to be a better fix, if possible. And obviously, the real fix would likely be on the bootloader side.
Please open an issue in the main AP
Ok, no worries, I will do that.
I apologize.
Thank you, apology accepted. Thank you for the good underlying issue and the understanding of my frustration.
Could you elaborate on how it is broken? Given it works with PX4 firmware, I'm not 100% following. I do understand that it doesn't seem to be "compatible" with ArduPilot/ChibiOS, we can agree on that.
I phrased poorly here, by "broken" I mean "the user does not have a working flight controller"; it crashes on boot. And by "works" I mean that the flight controller boots and can be used.
Ok, if I understand you correctly, this means that this workaround isn't "a solution", so there would need to be a better fix, if possible. And obviously, the real fix would likely be on the bootloader side.
I think this is actually an ArduPilot problem, it is strange for us to rely on a RAM segment which is not available out of reset. But we can discuss further on that new issue.
Here is the issue: https://github.com/ArduPilot/ardupilot/issues/31546