FirebaseRealTimeDatabase failing consistently in certain devices
[REQUIRED] Step 2: Describe your environment
- Android Studio version: Android Studio Giraffe
- Firebase Component: Real Time Database
- Component version: BOM 32.7.2 (and earlier too)
[REQUIRED] Step 3: Describe the problem
Steps to reproduce:
We have a small subset of users (<1%) that are experimenting problems when reading any information from our RTDB. These users experience an infinite "loading" that never gets completed (with error or success). The interesting part of the problem is that these users experiment this 100% of the times, but only in a particular device. If they change to a different device, the information loads, so it is not a bug related to their user account but to their devices. Moreover, login out or reinstalling the app does not fix the issue, so it does not seem to be related to a wrong previous session token or similar. No matter what they do, the information coming from RTDB does not load (one user claims it works in WIFI but not in cellular, but we could not verify it), but the rest of the API calls to our services are OK, so the device is connected to Internet. We have also checked with some of the affected users that they do not have any ad-blocker or any other kind of network interceptors, and we do not see any pattern in the affected devices (different brands, models and OS versions).
Some affected devices: SM-G975F - Android 12 21081111RG - Android 13 Mi A3 - Android 11 22101316UG - Android 13 2107113SG - Android 13 SM-A405FN - Android 11 SM-G996B - Android 12 Xiaomi 11T - Android 13
The bug is only reproduced in Android, the iOS version loads properly.
Initially we thought it was due to the use of an old version of Firebase libraries (still including SafetyNet as internal dependency), but after upgrading to latest BOM (32.7.2) we still see the exact same issue.
Due to the nature of the bug, we could not find any way for reproducing it. It just seems to happen in some random devices.
Relevant Code:
Debugging the code with remote logs, we can see the affected devices calling this Firebase code:
handler = query.addValueEventListener(object : ValueEventListener {
override fun onDataChange(snapshot: DataSnapshot) {
dataMapper(snapshot)?.let {
value = it
}
}
override fun onCancelled(error: DatabaseError) {
errorProcessor(this@FirebaseObservable, error)
}
})
However, there seems to be no callback invoked, neither the onDataChange nor the onCancelled seems to be called, and therefore the app just keeps waiting "forever".
Any idea why these devices are failing? How can we solve it?
Thanks!
I found a few problems with this issue:
- I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.
- This issue does not seem to follow the issue template. Make sure you provide all the required information.
Hi @angelolloqui, were you able to reproduce the issue using the same type of device you mentioned? Can you share an MCVE to help us investigate the issue? If you're unable to provide one, can you try our quickstart app and see if you can reproduce the issue? Also, can you share the stacktrace? Thanks
Hi @lehcar09 thanks for replaying, trying to answer your questions:
were you able to reproduce the issue using the same type of device you mentioned? Can you share an MCVE to help us investigate the issue?
Unfortunately, as I commented in the description, we have not been able to reproduce it ourselves, not even for me to debug and understand what is going on, so I can not provide an example project (and even if I could, it would only happen in some very random devices).
can you try our quickstart app and see if you can reproduce the issue?
I do not have a reproduction device so I can not test it myself, but I will try to forward it to some of our affected users. I will try to come with some info in a few days.
can you share the stacktrace?
There is no stack trace, because it just hungs waiting for the callback, but there is no exception triggered (at least not from what I can see in my monitoring tools).
Thanks for the help, I know it is super difficult to address the issue with this data, I am aware and ask for apologies... but this is all I have and I need to try helping these users, as they are desperate because the bug is ruining an important part of the product.
I am still waiting for user testing the demo app. In meanwhile, I have observed with the update to AGP8.0 that we have this warning that probably is not related but could maybe explain something...
ASM Instrumentation process wasn't able to resolve some classes, this means that
the instrumented classes might contain corrupt stack frames. Make sure the
dependencies that contain these classes are on the runtime or the provided
classpath. Otherwise, the jvm might fail to load the corrupt classes at runtime
when running in a jvm environment like unit tests.
Classes that weren't resolved:
> com.google.android.libraries.places.api.model.AddressComponents
> sun.misc.Unsafe
> com.amazon.device.messaging.ADM
> com.google.common.util.concurrent.ListenableFuture
> androidx.compose.animation.tooling.ComposeAnimatedProperty
and if I inspect the dependencies, it seems that ListenableFuture is used by several google dependencies, including transitive ones to firebase database. Then the sun.misc.Unsafe is used by Guava, which is also a transitive dependency to several firebase products. Might it be that somehow those devices are not finding the missing symbols on runtime and failing silently (after some kind of error handling)?
My rationale is that if Firebase uses ListenableFuture for asynchronous operations and its dependencies are not resolved properly, it might affect the execution of asynchronous callbacks, potentially leading to situations where callbacks are not called. Additionally, the fact that this occurs on specific devices suggests that there might be something unique about the environment, runtime, or configuration on these devices . This could include differences in the Java Runtime Environment that affect class loading or library resolution.
I just got confirmation from one of the affected users. He tried the quickstart app and it is also failing. Same behavior, not loading any data. So this I believe proves that the issue is not in our app but somewhere in Firebase, the device configuration or the Internet provider. On this last regard, I got another new user stating with some pictures that the issue is only reproducible in mobile network and not in WIFI on his case. Any idea of the root cause? how can we proceed? can I assist somehow? this is frustrating more and more of our users and we are getting some brand damage because we have not been able to provide them with a solution in several weeks already...
We are continuing receiving more reports... I think I might have found a pattern since I am seeing that most affected users (I can not confirm they all because I lack the data) are from Spain. Not sure this is a clear indication, since our main market is Spain, but it seems the ratio is definitely high in this particular area. I am requesting some of them their network providers, just in case Firebase got blocked somehow.
Hey @angelolloqui, thank you so much for the well detailed information you've provided so far, and I'm sorry to hear that your users are experiencing this behavior.
Reading through the thread, it seems that the issue points toward device limitation. I'm not sure if the issue is due to a network or region, since you've mentioned that iOS users do not exhibit any of these issue. I would suspect that iOS users would have an overlap with android users using the same network providers, but I don't think it would hurt to investigate this route as well.
If it is due to a network issue, then it could be possible that the issue is a webSocket connection either timed out or couldn't be established. ISPs in certain countries also have a tendency to suddenly block WebSockets. Unfortunately, there isn't anything we can do if it's due to a network block.
I could only advise you to try using RTDB's REST API or Cloud Firestore if you are in need urgent solution. Perhaps this mitigation will help.
Without any stacktrace or minimal reproducible example, it's difficult to move the investigation forward.
In the meantime, feel free to add any more updates that you think might be helpful. It could lead us to a breakthrough.
Hi @argzdev , first thanks for responding. Following up on your points:
Without any stacktrace or minimal reproducible example, it's difficult to move the investigation forward.
I understand your comment, and it is very frustrating to me too. Anyway, just to clarify, the same behavior can be reproduced in your official firebase quickstart example project, so the code is there, and when reproduced it happens consistently. The problem is finding out a concrete device failing.
I'm not sure if the issue is due to a network or region, since you've mentioned that iOS users do not exhibit any of these issue. I would suspect that iOS users would have an overlap with android users using the same network providers, but I don't think it would hurt to investigate this route as well.
Exactly, and for that reason I introduced an "artificial" timeout of 10s (firing if no response in that time) to both platforms a few weeks ago, so I can count how many timeouts our users are finding in both platforms. As you can see from this screenshot, in last 7 days, there is a significant disparity in both platforms, pointing to an Android-specificproblem (and confirming our customer care reports, that are only Android so far). While some iOS timeouts are noted, they likely result from actual network constraints, not the bug in question.
Regarding network providers, I initially considered them as the culprit too. However, tracking timeouts has revealed that the issue transcends regions and carriers. Apart from that, considering it happens only in Android and that we have users reporting that in the same WIFI network they have one device were it is reproduced and another were it is not, makes me think it is unrelated.
I could only advise you to try using RTDB's REST API or Cloud Firestore if you are in need urgent solution. Perhaps this mitigation will help.
That could be a workaround indeed, at least to be able to fetch the initial data, but it will require significant work to be able to build this alternative loading and will not be an ideal solution since we will be lacking the reactive nature of the FIR sockets (we are building a chat, so reactivity is crucial). Anyway, I will have that in mind, and explore possible implementation cost as a temporal workaround.
Hey @angelolloqui, we'll keep the needs-info tag for now. Let us know if the issue is resolved on your side.
Please do keep in mind that our SDK support team does not have access to the backend or your project details. For us to verify if this is an SDK issue, we require steps to reproduce this behavior or a minimal reproducible example for us to conduct investigation on your issue.
Don't worry if the issue closes due to stale, we can always reopen this once we have new information. Thanks!
Hey @angelolloqui. We need more information to resolve this issue but there hasn't been an update in 5 weekdays. I'm marking the issue as stale and if there are no new updates in the next 5 days I will close it automatically.
If you have more information that will help us get to the bottom of this, just add a comment!
Hey @lehcar09, unfortunately I am unable to provide you with more details. The example project is not needed since it also happens in your own firebase example project, your own code. The problem is that it is only reproducable in certain devices, and I never managed to get my hands on one. All I can tell is my users are complaining about this, and when I asked them to install your example app and try firebase from there they also got the same issue. It seems that is something device or network related.
We received multiple reports of this today (Firebase RTDB Android SDK + WiFi), users both in California as well as Canada.