Android: Runtime bug - ANR in com.ansca.corona.Controller - new insight from Google Play
- [x ] I tried searching Solar2D forums for the solution.
Describe the bug Play Store has been reporting ANRs deep inside Solar2d for years. It has been discussed in forums, such as here but without conclusive results.
There is a new built-in analysis of ANRs in the Play Store and perhaps it can shed light on the root cause? See below
To Reproduce Steps to reproduce the behavior:
- Not reproducible on demand, but some insights discussed in the forum post linked above.
Target platform and where build was made:
- Device: onn 100071485
- OS: Android 12 (SDK 31)
- Solar2d 2023.3696
Additional context
whole test project or build.settings Available on demand
Similar issue:
Not sure if it's 100% related, but definitely could be: https://discuss.cocos2d-x.org/t/is-there-any-solution-of-anr/43080.
Another interesting topic: https://issuetracker.google.com/issues/263307511 it's from a native Android app, but a bit below is an example project can be found & a video showing what happens when the ANR pops up
I have the demo, steps, and screen recording to reproduce this ANR. @Shchvova
Use the Lua code as below or modified sample - Fishies to reproduction the Controller.stop() ANR with Solar2D 3699:
--[[
This is a demonstration for reproducing an ANR issue with
Android Controller.stop()/start().
Triggering the Android lifecycle while Lua is processing heavy
work (as shown below) can replicate the problem.
]]
local sum = 0
local co = coroutine.create(function ()
for i = 1, 100000 do
sum = sum + i
local start_time = os.clock()
while os.clock() - start_time < 0.1 do
end
coroutine.yield(sum)
end
end)
timer.performWithDelay(10, function()
for i = 1, 100 do
if co and coroutine.status(co) ~= "dead" then
coroutine.resume(co)
end
end
end, 0)
The steps are as follows (Using the timeline of screen recording):
-
00:00:01Start by testing the app as usual. -
00:00:13Enter the game scenes. Due to heavy computation, there is only one frame update at 22 and 32 seconds, which is not significant. -
00:00:33Exit or kill the app. -
00:00:35Manually restart the app from the desktop. Android Studio's first re-run won't reproduce this ANR. -
00:00:37Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds. -
00:00:44Unlock the screen. -
00:00:45Seeing the ANR pop-up, useadb bugreportto retrieve it from the phone. This reproducedController.stop()ANR. -
00:00:50Exit or kill the app. -
00:01:00Manually restart the app from the desktop. -
00:01:01Exit to the desktop before fully entering the game scenes. -
00:01:04Re-enter the app. -
00:01:06Lock the screen by pressing the power button before fully entering the game scenes. Wait for at least 5 seconds. -
00:01:16See whether the ANR pop-up appears or not, as the app may have been killed by the system. This reproducedController.start()ANR (sometimes together withController.stop()ANR).
The screen recording:
https://github.com/coronalabs/corona/assets/53489032/23696249-eb74-4277-b242-061214caf791
I tested 3699 with a simple patch (either a draft or proof of concept) and the ANR disappeared. However, I'm not familiar with the entire locking process, so I'm not sure about potential effects.
platform/android/sdk/src/com/ansca/corona/graphics/opengl/GLSurfaceView.java
public void surfaceDestroyed() {
synchronized(sGLThreadManager) {
if (LOG_THREADS) {
Log.i("GLThread", "surfaceDestroyed tid=" + getId());
}
mHasSurface = false;
sGLThreadManager.notifyAll();
- while((!mWaitingForSurface) && (!mExited)) {
- try {
- sGLThreadManager.wait();
- } catch (InterruptedException e) {
- Thread.currentThread().interrupt();
- }
- }
+// while((!mWaitingForSurface) && (!mExited)) {
+// try {
+// sGLThreadManager.wait();
+// } catch (InterruptedException e) {
+// Thread.currentThread().interrupt();
+// }
+// }
}
}
platform/android/sdk/src/com/ansca/corona/CoronaActivity.java
private void requestResumeCoronaRuntime() {
...
// Start/resume the Corona runtime.
- fController.start();
+ new Thread(fController::start).start();
...
private void requestSuspendCoronaRuntime() {
// Suspend the Corona runtime.
if (fController != null) {
- fController.stop();
+ new Thread(fController::stop).start();
}
I hope the information helpful. Thank you.
Interesting @clang-clang-clang, is there a reason why you've use coroutines for the example? Wouldn't the ANR trigger otherwise? That would explain why not everyone is seeing the high ANR rate.
I'm not quite sure this is the right solution though. With my limited knowledge of locking in Java, simply removing the wait in surfaceDestroyed() could result in possible crashes in other methods which think everything is correctly destroyed, while the "destroy" is not actually finished yet.
I did spend some of last week on the ANR issue as well, and have actually come up with a different solution by looking at the libgdx library which also uses Android's GLSurfaceView.java. They had similar issues on the onPause state and avoid it by using two different methods.
- Killing the app when
onPausetakes longer than 4000ms (ANR triggers at 5000ms) - Using an optimised
updateRuntimeState()where not everything is locked/synchronized all of the time, just the moments whencontrolleris actually used.
Start of last weekend I've integrated these solutions in my own fork of Solar2D (https://github.com/pouwelsjochem/solar2d/blob/40f0134d8ee7447373d0900257909f36fd863ba3/platform/android/sdk/src/com/ansca/corona/Controller.java#L222) and slowly rolled that out to my users. Our ANR rate of this last build decreased by roughly 90%, from ~1% to ~0.1% (over 14K sessions in the last 24 hours). I believe Vlad is going to incorporate my changes into the main repo soon.
Great news. Glad to hear there's a solution.
Since I can't reproduce the online situation, I can only approximate the ANR by simulating the calculated pressure. Just keeping the GLThread busy and triggering the lifecycle can cause an ANR.
As for the 'patch' (it's a draft or proof of concept), it can't been production yet. I suspect that the sGLThreadManager also has a locking issue.