Performance optimizations for AsyncImage #2795

andkulikov · 2025-01-20T19:38:59Z

Hello, Coil maintainers!

I am from Compose team at Google and as part of our recent internal hackathon I tried to find opportunities to improve the performance of this library. I did manage to make noticeable improvements, and I hope we can work together on upstreaming them.

I did write a simple benchmark for verifying the improvements, however I am not contributing it as part of this pull request, as it used some tooling we use internally, but didn't release as a public library, unfortunately.

I was benchmarking this composable:

    AsyncImage(
        modifier = Modifier.fillMaxSize(),
        model = androidx.benchmark.R.drawable.logo,
        contentDescription = null,
        contentScale = ContentScale.Crop
    )

and wrote two benchmarks:
• timeToFirstPixel. The benchmark combining the initial composition, measure and draw using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarkFirstExtensions.kt;l=35;drc=dcaa116fbfda77e64a319e1668056ce3b032469f
• reuse. The benchmark measuring the work needed to be performed when we reuse composables of the same structure during the lazy lists scrolling using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarksExtensions.kt;l=435;drc=86a86e1870faa9f1b475d54e37199f0fdb35f115

The end numbers I've got with this change:
• timeToFirstPixel improved by 25% in timeNs and 35% in allocations.
• reuse improved by 40% in timeNs and 48% in allocations.

Main changes are:
• coroutines logic from AsyncImagePainter is simplified. the previous one was adding a lot of unnecessary overhead, every coroutine launch is adding a lot of extra work and allocations. I reimplemented it without subscribing for two flows in order to restart the job when the input did change or if the restart is manually requested. from my testing, it seems like the observed logic didn't change and we still do the same work in the same order. I hope it is true for all the possible corner cases, please let me know if you think I missed something important.
• flow observation from ConstraintsSizeResolver was replaced with a simpler manual coroutine restoration once we have the size. even that conceptually it is natural to use flow for such use case, this change was also showing a noticeable improvement in the benchmark. I wish common coroutine constructs like this one didn't have that much of the performance overhead.
• the code is now more optimized for reuse use case: the main important bit here is we do reuse Modifier.Node objects, but we recreate everything we remember in composition. in order to save on a lot of allocations I restructured the code so instead of doing it in composition we create AsyncImagePainter once during the Modifier.Node creation and don't re-allocate it when reuse. unfortunately, I wasn't able to apply the same optimization for SubcomposeAsyncImage as there the painter is exposed in the public api used in composition.
• all the modifiers are combined into one to be more efficient and have a smarter invalidation. we don't add ConstraintsSizeResolver as a modifier anymore, instead we call it right from our main modifier. semantic modifier is moved as part of our main modifier as well. we don't add clipToBounds() modifier anymore as clipping manually as part of the drawing is more efficient compared to introducing extra GraphicsLayer (RenderNode) for it

Full benchmark results:

Before
CoilBenchmark.reuse
timeNs            min 3,519,301.2,   median 3,566,884.4,   max 4,548,294.8
instructions      min   709,144.6,   median   710,586.8,   max   831,324.2
l1DMisses         min    22,150.8,   median    22,255.0,   max    26,290.9
branchMisses      min    20,426.8,   median    20,503.2,   max    35,141.8
allocationCount   min       420.0,   median       420.0,   max       420.5
CoilBenchmark.timeToFirstPixel
timeNs            min 3,776,971.3,   median 3,924,422.8,   max 5,283,567.1
instructions      min   733,797.9,   median   739,403.5,   max   906,632.5
l1DMisses         min    22,722.0,   median    23,084.3,   max    29,304.1
branchMisses      min    21,959.9,   median    22,170.1,   max    50,751.5
allocationCount   min       609.6,   median       609.6,   max       617.8

After
CoilBenchmark.reuse
timeNs            min 2,089,110.8,   median 2,131,318.5,   max 2,545,147.8
instructions      min   315,215.8,   median   315,790.5,   max   376,257.2
l1DMisses         min    13,759.1,   median    13,830.8,   max    15,850.1
branchMisses      min    11,412.5,   median    11,447.6,   max    19,016.3
allocationCount   min       220.0,   median       220.0,   max       221.4
CoilBenchmark.timeToFirstPixel
timeNs            min 2,868,448.7,   median 3,091,769.0,   max 4,091,858.8
instructions      min   396,472.7,   median   400,086.9,   max   603,784.3
l1DMisses         min    16,288.2,   median    16,459.6,   max    23,505.5
branchMisses      min    14,539.8,   median    14,677.9,   max    45,211.0
allocationCount   min       402.0,   median       402.2,   max       406.2

Before CoilBenchmark.reuse timeNs min 3,519,301.2, median 3,566,884.4, max 4,548,294.8 instructions min 709,144.6, median 710,586.8, max 831,324.2 l1DMisses min 22,150.8, median 22,255.0, max 26,290.9 branchMisses min 20,426.8, median 20,503.2, max 35,141.8 allocationCount min 420.0, median 420.0, max 420.5 CoilBenchmark.timeToFirstPixel timeNs min 3,776,971.3, median 3,924,422.8, max 5,283,567.1 instructions min 733,797.9, median 739,403.5, max 906,632.5 l1DMisses min 22,722.0, median 23,084.3, max 29,304.1 branchMisses min 21,959.9, median 22,170.1, max 50,751.5 allocationCount min 609.6, median 609.6, max 617.8 After CoilBenchmark.reuse timeNs min 2,089,110.8, median 2,131,318.5, max 2,545,147.8 instructions min 315,215.8, median 315,790.5, max 376,257.2 l1DMisses min 13,759.1, median 13,830.8, max 15,850.1 branchMisses min 11,412.5, median 11,447.6, max 19,016.3 allocationCount min 220.0, median 220.0, max 221.4 CoilBenchmark.timeToFirstPixel timeNs min 2,868,448.7, median 3,091,769.0, max 4,091,858.8 instructions min 396,472.7, median 400,086.9, max 603,784.3 l1DMisses min 16,288.2, median 16,459.6, max 23,505.5 branchMisses min 14,539.8, median 14,677.9, max 45,211.0 allocationCount min 402.0, median 402.2, max 406.2

colinrtwhite

This is awesome! Thanks so much for creating the benchmarks + investigating. Left a few comments, but I'm on board with all the changes.

colinrtwhite · 2025-01-21T06:36:36Z

coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt

-                    }
-                }
-            }.collect()
+        rememberJob = scope.launchUndispatched {


I think we might still need DeferredDispatchCoroutineScope here and withDeferredDispatch below. More info here, but without it we'll dispatch when we reach the first withContext in RealImageLoader and miss returning from the memory cache synchronously.

yes, I saw this PR and after reading it I thought that maybe this simpler version will work as well, because there you mention:
"To fix this I tried using CoroutineStart.UNDISPATCHED with a StateFlow. This works for initial composition, but does not resolve synchronously from the memory cache on subsequent requests (e.g. when the AsyncImage(model) changes)."
I did keep the part where we launch the coroutine with CoroutineStart.UNDISPATCHED, so the initial composition should work (and it works fine in my testing, but I test not in the Paparazzi environment obviously).
and for the subsequent requests when model changes it might work fine as well because we just manually cancel the previous job and start it from scratch in the same way as initially.
what do you think? is there a way to test the use case from there with the code from this PR?

@andkulikov You're right I think that relaunching the image request whenever input changes solves that problem, however there other issues with using CoroutineStart.UNDISPATCHED that I failed to mention. Overall there are many ways to accidentally trigger a dispatch when using CoroutineStart.UNDISPATCHED, which is why I opted to create DeferredDispatchCoroutineScope. Using withContext, collect, or other coroutine operators will call CoroutineDispatcher.isDispatchNeeded and trigger a dispatch.

The PreviewScreenshots.vector test will fail if dispatching occurs. It's currently passing as launchUndispatched overwrites the Compose main thread dispatcher with Dispatchers.Unconfined. This was done to avoid dispatching before hitting withDeferredDispatch. If I remove Dispatchers.Unconfined from launchUndispatched then the test fails. You can use ./gradlew updateDebugScreenshotTest to quickly rerun the preview screenshot tests.

The goal of DeferredDispatchCoroutineScope/withDeferredDispatch was to avoid dispatching until the coroutine context's CoroutineDispatcher changes. This way we can execute synchronously until we reach the memory cache check or - in the case of previews/tests - the whole image pipeline will execute synchronously as we set EmptyCoroutineContext for fetcherCoroutineContext/decoderCoroutineContext instead of the IO dispatcher.

I've tried to update launchUndispatched to not provide a dispatcher like this launch(start = CoroutineStart.UNDISPATCHED, block = block) and I wasn't able to break PreviewScreenshots.vector, it seem to pass.

Ok, I uploaded a new commit to start using DeferredDispatchCoroutineScope again in the same way as before. It is slightly slower again, but not dramatically:

Without DeferredDispatchCoroutineScope: CoilBenchmark.timeToFirstPixel timeNs min 2,870,331.0, median 3,061,765.3, max 3,858,761.8 allocationCount min 402.1, median 402.9, max 404.1 CoilBenchmark.reuse timeNs min 2,102,943.6, median 2,139,085.8, max 2,447,290.0 allocationCount min 220.0, median 220.0, max 220.3 With DeferredDispatchCoroutineScope: CoilBenchmark.timeToFirstPixel timeNs min 2,941,350.4, median 3,085,572.9, max 3,943,752.3 allocationCount min 428.1, median 428.8, max 430.4 CoilBenchmark.reuse timeNs min 2,252,515.0, median 2,286,389.6, max 2,636,826.0 allocationCount min 245.0, median 245.0, max 245.4

However, I still believe this code could be simplified further now as we don't collect from the flows in order to restart anymore. It feels like withDeferredDispatch was needed to apply a new dispatcher after flow collection. as we don't collect from the flow anymore we can start with this dispatcher straight away. and if I understand correctly we already start with the needed dispatcher as we apply Dispatchers.Unconfined inside of launchUndispatched. I might be missing something, but it feels like withDeferredDispatch is always a no-op now because of that. And if we don't need withDeferredDispatch, then we don't need DeferredDispatchCoroutineScope as well.

I think I understood your idea better now and why we need DeferredDispatchCoroutineScope and DeferredDispatchCoroutineDispatcher. I am happy to keep them, but as an extra optimization we can indeed start with DeferredDispatchCoroutineDispatcher straight away from launchUndispatched, instead of first starting with Dispatchers.Unconfined and then immediately switching to DeferredDispatchCoroutineDispatcher. I uploaded it in a new commit and the allocations numbers are improved:

CoilBenchmark.timeToFirstPixel timeNs min 2,923,557.9, median 3,067,062.5, max 4,294,279.0 allocationCount min 415.1, median 415.1, max 418.1 CoilBenchmark.reuse timeNs min 2,160,035.3, median 2,200,086.6, max 2,537,839.7 allocationCount min 233.0, median 233.0, max 233.3

With that, I believe, the end behavior should be the same as before my changes and we should be good to go

@andkulikov Awesome thanks! I'll also write up a better test to verify the immediate dispatching behaviour. PreviewScreenshots.vector only happens to test this behaviour as a side effect. There are tests for DeferredDispatchCoroutineScope, but they're only unit tests and we need an integration test for this.

coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt

colinrtwhite · 2025-01-21T06:45:31Z

coil-compose-core/src/commonMain/kotlin/coil3/compose/ConstraintsSizeResolver.kt

-        replay = 1,
-        onBufferOverflow = DROP_OLDEST,
-    )
+    private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0)


Awesome! TIL you can get some extra performance by manually resuming the Continuation.

Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.

colinrtwhite · 2025-01-21T06:51:54Z

coil-compose-core/src/commonMain/kotlin/coil3/compose/internal/ContentPainterModifier.kt

+        // we are creating painter during the modifier creation. as a result we reuse the same
+        // painter object when the modifier is being reused as part of lazy layouts reuse flow.
+        // it allows us to save on quite a lot of allocations during the reuse.
+        val painter = AsyncImagePainter(input)


ShikaSD · 2025-01-21T14:52:17Z

coil-compose-core/src/commonMain/kotlin/coil3/compose/ConstraintsSizeResolver.kt

-        replay = 1,
-        onBufferOverflow = DROP_OLDEST,
-    )
+    private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0)


Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.

ShikaSD · 2025-01-21T14:54:43Z

coil-compose-core/src/commonMain/kotlin/coil3/compose/ConstraintsSizeResolver.kt

+            suspendCoroutine { continuation = it }
+            oldContinuation?.resume(Unit)
+        }
+        return latestConstraints.toSize()


This can potentially report different constraints with particular order of invocations / updates to latestConstraints.
I don't think it is a problem here, but ideally you would pass the constraints to Continuation.resume. In general, I feel like it is save to avoid this here to minimize boxing, just wanted to flag this usage.

yeah, I think for this logic we always want to use the latest constraints regardless. so should be fine

coil-compose-core/src/commonMain/kotlin/coil3/compose/internal/ContentPainterModifier.kt

coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt

…r as well

…rElement

…hUndispatched

colinrtwhite

Huge thanks for doing this! These changes will be part of 3.1, which should be released soon.

colinrtwhite reviewed Jan 21, 2025

View reviewed changes

Fix import lines order

cb2c957

ShikaSD reviewed Jan 21, 2025

View reviewed changes

andkulikov added 5 commits January 21, 2025 18:13

Do not use mutable state for alpha and colorFilter in CrossfadePainte…

f91071b

…r as well

Rename ManagingPainterLifecycleContentPainterElement to ContentPainte…

02f1763

…rElement

Fix imports in CrossfadePainter

a45f5fd

Start using DeferredDispatchCoroutineScope again

b1a386d

Start with DeferredDispatchCoroutineDispatcher straight away in launc…

8a853fa

…hUndispatched

colinrtwhite approved these changes Jan 23, 2025

View reviewed changes

colinrtwhite enabled auto-merge (squash) January 23, 2025 22:50

colinrtwhite merged commit d3694d3 into coil-kt:main Jan 23, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations for AsyncImage #2795

Performance optimizations for AsyncImage #2795

andkulikov commented Jan 20, 2025

colinrtwhite left a comment •

edited

Loading

colinrtwhite Jan 21, 2025

andkulikov Jan 21, 2025

colinrtwhite Jan 21, 2025

andkulikov Jan 22, 2025

andkulikov Jan 23, 2025 •

edited

Loading

colinrtwhite Jan 23, 2025

colinrtwhite Jan 21, 2025

ShikaSD Jan 21, 2025

colinrtwhite Jan 21, 2025

ShikaSD Jan 21, 2025

ShikaSD Jan 21, 2025

andkulikov Jan 21, 2025

colinrtwhite left a comment

Performance optimizations for AsyncImage #2795

Performance optimizations for AsyncImage #2795

Conversation

andkulikov commented Jan 20, 2025

colinrtwhite left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andkulikov Jan 23, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinrtwhite left a comment

Choose a reason for hiding this comment

colinrtwhite left a comment •

edited

Loading

andkulikov Jan 23, 2025 •

edited

Loading