appkit Potential Race in ImageCache/ListImageLoader

This probably causes a sneaky bug in the Mastodon app that mostly affects the emoji picker

steps to reproduce:

devices used:

samsung tab s6 running aosp android 14
pixel 3a emulator on android 14 (easier to reproduce when making emojis smaller, so more are loaded at the same time, but still sometimes appears with defaults)

create account on mstdn.social
scroll really fast through emojis
sometimes, chunks of emojis are missing when scrolling back up - sometimes 5, sometimes 20 or more

I'm not sure whether I found the correct cause, but I think what's happening is:

request 1 opens editor, and then downloader is invoked
request 1 is canceled, but still holds the editor
an new request 2 for the same image tries to acquire the editor, but it's still held by request 1
request 2 fails, and is added to failedRequests in ListImageLoader
request 1 returns without writing anything to the disk key as it was canceled

I checked by logging what urls the request were for failures to acquire the editor and for cancelled downloads that just released their editor, and it seemed consistent with this. Logs look like this:

20:08:34.696  E  Failed to open disk cache editor for <url>
20:08:34.698  W  UrlImageLoaderRequest{desiredConfig=HARDWARE, desiredWidth=66, desiredHeight=66, uri=<url>, memoryCacheKey='HARDWARE_66_66_<url>', diskCacheKey='6301aaf66180f32e172ec210ab03eaca'}
                 java.lang.IllegalStateException: Another thread has this file open -- should never happen
                 	at me.grishka.appkit.imageloader.ImageCache.getInternal(ImageCache.java:337)
                 	at me.grishka.appkit.imageloader.ImageCache.get(ImageCache.java:272)
                 	at me.grishka.appkit.imageloader.ListImageLoader$RunnableTask.run(ListImageLoader.java:317)
                 	at me.grishka.appkit.imageloader.ListImageLoader.loadSingleItem(ListImageLoader.java:122)
                 	at me.grishka.appkit.imageloader.ListImageLoader.loadRange(ListImageLoader.java:66)
                 	at me.grishka.appkit.imageloader.ListImageLoader.loadRange(ListImageLoader.java:49)
                 	at me.grishka.appkit.imageloader.ListImageLoaderWrapper.reloadVisibleImages(ListImageLoaderWrapper.java:227)
                 	at me.grishka.appkit.imageloader.ListImageLoaderWrapper.onScroll(ListImageLoaderWrapper.java:259)
                 	at me.grishka.appkit.imageloader.ListImageLoaderWrapper$2.onScrolled(ListImageLoaderWrapper.java:59)
                 	at androidx.recyclerview.widget.RecyclerView.dispatchOnScrolled(RecyclerView.java:5688)
                 	at androidx.recyclerview.widget.RecyclerView$ViewFlinger.run(RecyclerView.java:5871)
                 	at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1406)
                 	at android.view.Choreographer$CallbackRecord.run(Choreographer.java:1415)
                 	at android.view.Choreographer.doCallbacks(Choreographer.java:1015)
                 	at android.view.Choreographer.doFrame(Choreographer.java:941)
                 	at android.view.Choreographer$FrameDisplayEventReceiver.run(Choreographer.java:1389)
                 	at android.os.Handler.handleCallback(Handler.java:959)
                 	at android.os.Handler.dispatchMessage(Handler.java:100)
                 	at android.os.Looper.loopOnce(Looper.java:232)
                 	at android.os.Looper.loop(Looper.java:317)
                 	at android.app.ActivityThread.main(ActivityThread.java:8705)
                 	at java.lang.reflect.Method.invoke(Native Method)
                 	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:580)
                 	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:886)
20:08:34.710  E  Canceled download for <url>

A fix is to only create the diskcache editor after the file has been successfully downloaded, although I don't know whether this creates ram issues though, and ideally this is fixed in ListImageLoader

See the PR draft: #18

Nov 17 '24 20:11 likeazir

Hmmm. I don't like the "download to memory first" approach in the PR. Files could potentially be large, and with many being downloaded concurrently, you could easily run out of memory. I feel like I forgot a synchronized somewhere.

Nov 17 '24 20:11 grishka

Hey, an update, I took another look and I think I found something and I wanted to share it because it may be helpful.

Here https://github.com/grishka/appkit/blob/a69ffd1a7ceac2411bb25c540efd19a0962acb6a/appkit/src/main/java/me/grishka/appkit/imageloader/ImageCache.java#L635 the http.cancel() does not always cancel the call immediately (its not synchronous) So removing the image from currently_loading is not correct because the disk key may still be in use. And then another request trying to load the same image fails.

However, I can only guess on what exactly happens. I think the root issue is probably creating lots of keys on disk, and deleting them again due to cancellations - scrolling through emojis on a tablet does this a lot. And then operations on the disk cache hang and things break.

I currently run with this: https://github.com/likeazir/appkit/commit/1d2d9b736413b522663f3830c49b513b3c741297 and it seems to fix everything for me. I know it's a bad idea to acquire a lock and create a file there - in practice it doesn't seem to break on my tablet/phone though. It's probably slightly more elegant to create the file just after the request has been sent with an okhttp interceptor. Might need to check how glide does it to properly fix it.

Jan 19 '25 17:01 likeazir

the http.cancel() does not always cancel the call immediately (its not synchronous)

Oh, interesting. That's news to me. Seeing how I would sometimes get things like NetworkOnMainThreadException if I don't call it on a background thread, I assumed that it was synchronous. Would it make sense to somehow wait for the request to be cancelled before sending a new one to the same URL, or does okhttp already handle that case correctly?

Your fix does make sense but I'd probably add a tiny abstraction layer, an interface that takes a key and returns an OutputStream, just so DiskLruCache itself doesn't have to be passed around and the error handling logic (closing and deleting things) stays in ImageCache itself and won't have to be duplicated in any custom ImageDownloaders that an app might add (that whole ImageDownloader thing exists so custom network protocols could be supported, I needed that when I was building a Telegram client with appkit ages ago).

Jan 19 '25 18:01 grishka

From what I understand in the okhttp source, calling .cancel tries to immediately destroy the connection if there is one (maybe this is/was the reason for synchronous network I/O in some cases), sets a canceled flag, and the onFail callback may be run later asynchronously. Inserting a new call to the same URL in the queue should not be an issue.

I'll implement your suggestions in the next few days, and check for performance regressions with mastodon and then do a merge request :)

Jan 20 '25 18:01 likeazir