Potential Race in ImageCache/ListImageLoader
This probably causes a sneaky bug in the Mastodon app that mostly affects the emoji picker
steps to reproduce:
mastodon-android revision with the bug
devices used:
- samsung tab s6 running aosp android 14
- pixel 3a emulator on android 14 (easier to reproduce when making emojis smaller, so more are loaded at the same time, but still sometimes appears with defaults)
- create account on mstdn.social
- scroll really fast through emojis
- sometimes, chunks of emojis are missing when scrolling back up - sometimes 5, sometimes 20 or more
I'm not sure whether I found the correct cause, but I think what's happening is:
- request 1 opens editor, and then downloader is invoked
- request 1 is canceled, but still holds the editor
- an new request 2 for the same image tries to acquire the editor, but it's still held by request 1
- request 2 fails, and is added to failedRequests in ListImageLoader
- request 1 returns without writing anything to the disk key as it was canceled
I checked by logging what urls the request were for failures to acquire the editor and for cancelled downloads that just released their editor, and it seemed consistent with this. Logs look like this:
20:08:34.696 E Failed to open disk cache editor for <url>
20:08:34.698 W UrlImageLoaderRequest{desiredConfig=HARDWARE, desiredWidth=66, desiredHeight=66, uri=<url>, memoryCacheKey='HARDWARE_66_66_<url>', diskCacheKey='6301aaf66180f32e172ec210ab03eaca'}
java.lang.IllegalStateException: Another thread has this file open -- should never happen
at me.grishka.appkit.imageloader.ImageCache.getInternal(ImageCache.java:337)
at me.grishka.appkit.imageloader.ImageCache.get(ImageCache.java:272)
at me.grishka.appkit.imageloader.ListImageLoader$RunnableTask.run(ListImageLoader.java:317)
at me.grishka.appkit.imageloader.ListImageLoader.loadSingleItem(ListImageLoader.java:122)
at me.grishka.appkit.imageloader.ListImageLoader.loadRange(ListImageLoader.java:66)
at me.grishka.appkit.imageloader.ListImageLoader.loadRange(ListImageLoader.java:49)
at me.grishka.appkit.imageloader.ListImageLoaderWrapper.reloadVisibleImages(ListImageLoaderWrapper.java:227)
at me.grishka.appkit.imageloader.ListImageLoaderWrapper.onScroll(ListImageLoaderWrapper.java:259)
at me.grishka.appkit.imageloader.ListImageLoaderWrapper$2.onScrolled(ListImageLoaderWrapper.java:59)
at androidx.recyclerview.widget.RecyclerView.dispatchOnScrolled(RecyclerView.java:5688)
at androidx.recyclerview.widget.RecyclerView$ViewFlinger.run(RecyclerView.java:5871)
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1406)
at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1415)
at android.view.Choreographer.doCallbacks(Choreographer.java:1015)
at android.view.Choreographer.doFrame(Choreographer.java:941)
at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:1389)
at android.os.Handler.handleCallback(Handler.java:959)
at android.os.Handler.dispatchMessage(Handler.java:100)
at android.os.Looper.loopOnce(Looper.java:232)
at android.os.Looper.loop(Looper.java:317)
at android.app.ActivityThread.main(ActivityThread.java:8705)
at java.lang.reflect.Method.invoke(Native Method)
at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:580)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:886)
20:08:34.710 E Canceled download for <url>
A fix is to only create the diskcache editor after the file has been successfully downloaded, although I don't know whether this creates ram issues though, and ideally this is fixed in ListImageLoader
See the PR draft: #18
Hmmm. I don't like the "download to memory first" approach in the PR. Files could potentially be large, and with many being downloaded concurrently, you could easily run out of memory. I feel like I forgot a synchronized somewhere.
Hey, an update, I took another look and I think I found something and I wanted to share it because it may be helpful.
Here https://github.com/grishka/appkit/blob/a69ffd1a7ceac2411bb25c540efd19a0962acb6a/appkit/src/main/java/me/grishka/appkit/imageloader/ImageCache.java#L635 the http.cancel() does not always cancel the call immediately (its not synchronous)
So removing the image from currently_loading is not correct because the disk key may still be in use. And then another request trying to load the same image fails.
However, I can only guess on what exactly happens. I think the root issue is probably creating lots of keys on disk, and deleting them again due to cancellations - scrolling through emojis on a tablet does this a lot. And then operations on the disk cache hang and things break.
I currently run with this: https://github.com/likeazir/appkit/commit/1d2d9b736413b522663f3830c49b513b3c741297 and it seems to fix everything for me. I know it's a bad idea to acquire a lock and create a file there - in practice it doesn't seem to break on my tablet/phone though. It's probably slightly more elegant to create the file just after the request has been sent with an okhttp interceptor. Might need to check how glide does it to properly fix it.
the http.cancel() does not always cancel the call immediately (its not synchronous)
Oh, interesting. That's news to me. Seeing how I would sometimes get things like NetworkOnMainThreadException if I don't call it on a background thread, I assumed that it was synchronous. Would it make sense to somehow wait for the request to be cancelled before sending a new one to the same URL, or does okhttp already handle that case correctly?
Your fix does make sense but I'd probably add a tiny abstraction layer, an interface that takes a key and returns an OutputStream, just so DiskLruCache itself doesn't have to be passed around and the error handling logic (closing and deleting things) stays in ImageCache itself and won't have to be duplicated in any custom ImageDownloaders that an app might add (that whole ImageDownloader thing exists so custom network protocols could be supported, I needed that when I was building a Telegram client with appkit ages ago).
From what I understand in the okhttp source, calling .cancel tries to immediately destroy the connection if there is one (maybe this is/was the reason for synchronous network I/O in some cases), sets a canceled flag, and the onFail callback may be run later asynchronously. Inserting a new call to the same URL in the queue should not be an issue.
I'll implement your suggestions in the next few days, and check for performance regressions with mastodon and then do a merge request :)