Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations for AsyncImage #2795

Merged
merged 7 commits into from
Jan 23, 2025

Conversation

andkulikov
Copy link
Contributor

Hello, Coil maintainers!

I am from Compose team at Google and as part of our recent internal hackathon I tried to find opportunities to improve the performance of this library. I did manage to make noticeable improvements, and I hope we can work together on upstreaming them.

I did write a simple benchmark for verifying the improvements, however I am not contributing it as part of this pull request, as it used some tooling we use internally, but didn't release as a public library, unfortunately.

I was benchmarking this composable:

    AsyncImage(
        modifier = Modifier.fillMaxSize(),
        model = androidx.benchmark.R.drawable.logo,
        contentDescription = null,
        contentScale = ContentScale.Crop
    )

and wrote two benchmarks:
• timeToFirstPixel. The benchmark combining the initial composition, measure and draw using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarkFirstExtensions.kt;l=35;drc=dcaa116fbfda77e64a319e1668056ce3b032469f
• reuse. The benchmark measuring the work needed to be performed when we reuse composables of the same structure during the lazy lists scrolling using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarksExtensions.kt;l=435;drc=86a86e1870faa9f1b475d54e37199f0fdb35f115

The end numbers I've got with this change:
• timeToFirstPixel improved by 25% in timeNs and 35% in allocations.
• reuse improved by 40% in timeNs and 48% in allocations.

Main changes are:
• coroutines logic from AsyncImagePainter is simplified. the previous one was adding a lot of unnecessary overhead, every coroutine launch is adding a lot of extra work and allocations. I reimplemented it without subscribing for two flows in order to restart the job when the input did change or if the restart is manually requested. from my testing, it seems like the observed logic didn't change and we still do the same work in the same order. I hope it is true for all the possible corner cases, please let me know if you think I missed something important.
• flow observation from ConstraintsSizeResolver was replaced with a simpler manual coroutine restoration once we have the size. even that conceptually it is natural to use flow for such use case, this change was also showing a noticeable improvement in the benchmark. I wish common coroutine constructs like this one didn't have that much of the performance overhead.
• the code is now more optimized for reuse use case: the main important bit here is we do reuse Modifier.Node objects, but we recreate everything we remember in composition. in order to save on a lot of allocations I restructured the code so instead of doing it in composition we create AsyncImagePainter once during the Modifier.Node creation and don't re-allocate it when reuse. unfortunately, I wasn't able to apply the same optimization for SubcomposeAsyncImage as there the painter is exposed in the public api used in composition.
• all the modifiers are combined into one to be more efficient and have a smarter invalidation. we don't add ConstraintsSizeResolver as a modifier anymore, instead we call it right from our main modifier. semantic modifier is moved as part of our main modifier as well. we don't add clipToBounds() modifier anymore as clipping manually as part of the drawing is more efficient compared to introducing extra GraphicsLayer (RenderNode) for it

Full benchmark results:

Before
CoilBenchmark.reuse
timeNs            min 3,519,301.2,   median 3,566,884.4,   max 4,548,294.8
instructions      min   709,144.6,   median   710,586.8,   max   831,324.2
l1DMisses         min    22,150.8,   median    22,255.0,   max    26,290.9
branchMisses      min    20,426.8,   median    20,503.2,   max    35,141.8
allocationCount   min       420.0,   median       420.0,   max       420.5
CoilBenchmark.timeToFirstPixel
timeNs            min 3,776,971.3,   median 3,924,422.8,   max 5,283,567.1
instructions      min   733,797.9,   median   739,403.5,   max   906,632.5
l1DMisses         min    22,722.0,   median    23,084.3,   max    29,304.1
branchMisses      min    21,959.9,   median    22,170.1,   max    50,751.5
allocationCount   min       609.6,   median       609.6,   max       617.8

After
CoilBenchmark.reuse
timeNs            min 2,089,110.8,   median 2,131,318.5,   max 2,545,147.8
instructions      min   315,215.8,   median   315,790.5,   max   376,257.2
l1DMisses         min    13,759.1,   median    13,830.8,   max    15,850.1
branchMisses      min    11,412.5,   median    11,447.6,   max    19,016.3
allocationCount   min       220.0,   median       220.0,   max       221.4
CoilBenchmark.timeToFirstPixel
timeNs            min 2,868,448.7,   median 3,091,769.0,   max 4,091,858.8
instructions      min   396,472.7,   median   400,086.9,   max   603,784.3
l1DMisses         min    16,288.2,   median    16,459.6,   max    23,505.5
branchMisses      min    14,539.8,   median    14,677.9,   max    45,211.0
allocationCount   min       402.0,   median       402.2,   max       406.2

Before

CoilBenchmark.reuse
timeNs            min 3,519,301.2,   median 3,566,884.4,   max 4,548,294.8
instructions      min   709,144.6,   median   710,586.8,   max   831,324.2
l1DMisses         min    22,150.8,   median    22,255.0,   max    26,290.9
branchMisses      min    20,426.8,   median    20,503.2,   max    35,141.8
allocationCount   min       420.0,   median       420.0,   max       420.5

CoilBenchmark.timeToFirstPixel
timeNs            min 3,776,971.3,   median 3,924,422.8,   max 5,283,567.1
instructions      min   733,797.9,   median   739,403.5,   max   906,632.5
l1DMisses         min    22,722.0,   median    23,084.3,   max    29,304.1
branchMisses      min    21,959.9,   median    22,170.1,   max    50,751.5
allocationCount   min       609.6,   median       609.6,   max       617.8

After

CoilBenchmark.reuse
timeNs            min 2,089,110.8,   median 2,131,318.5,   max 2,545,147.8
instructions      min   315,215.8,   median   315,790.5,   max   376,257.2
l1DMisses         min    13,759.1,   median    13,830.8,   max    15,850.1
branchMisses      min    11,412.5,   median    11,447.6,   max    19,016.3
allocationCount   min       220.0,   median       220.0,   max       221.4

CoilBenchmark.timeToFirstPixel
timeNs            min 2,868,448.7,   median 3,091,769.0,   max 4,091,858.8
instructions      min   396,472.7,   median   400,086.9,   max   603,784.3
l1DMisses         min    16,288.2,   median    16,459.6,   max    23,505.5
branchMisses      min    14,539.8,   median    14,677.9,   max    45,211.0
allocationCount   min       402.0,   median       402.2,   max       406.2
Copy link
Member

@colinrtwhite colinrtwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks so much for creating the benchmarks + investigating. Left a few comments, but I'm on board with all the changes.

}
}
}.collect()
rememberJob = scope.launchUndispatched {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might still need DeferredDispatchCoroutineScope here and withDeferredDispatch below. More info here, but without it we'll dispatch when we reach the first withContext in RealImageLoader and miss returning from the memory cache synchronously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I saw this PR and after reading it I thought that maybe this simpler version will work as well, because there you mention:
"To fix this I tried using CoroutineStart.UNDISPATCHED with a StateFlow. This works for initial composition, but does not resolve synchronously from the memory cache on subsequent requests (e.g. when the AsyncImage(model) changes)."
I did keep the part where we launch the coroutine with CoroutineStart.UNDISPATCHED, so the initial composition should work (and it works fine in my testing, but I test not in the Paparazzi environment obviously).
and for the subsequent requests when model changes it might work fine as well because we just manually cancel the previous job and start it from scratch in the same way as initially.
what do you think? is there a way to test the use case from there with the code from this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andkulikov You're right I think that relaunching the image request whenever input changes solves that problem, however there other issues with using CoroutineStart.UNDISPATCHED that I failed to mention. Overall there are many ways to accidentally trigger a dispatch when using CoroutineStart.UNDISPATCHED, which is why I opted to create DeferredDispatchCoroutineScope. Using withContext, collect, or other coroutine operators will call CoroutineDispatcher.isDispatchNeeded and trigger a dispatch.

The PreviewScreenshots.vector test will fail if dispatching occurs. It's currently passing as launchUndispatched overwrites the Compose main thread dispatcher with Dispatchers.Unconfined. This was done to avoid dispatching before hitting withDeferredDispatch. If I remove Dispatchers.Unconfined from launchUndispatched then the test fails. You can use ./gradlew updateDebugScreenshotTest to quickly rerun the preview screenshot tests.

The goal of DeferredDispatchCoroutineScope/withDeferredDispatch was to avoid dispatching until the coroutine context's CoroutineDispatcher changes. This way we can execute synchronously until we reach the memory cache check or - in the case of previews/tests - the whole image pipeline will execute synchronously as we set EmptyCoroutineContext for fetcherCoroutineContext/decoderCoroutineContext instead of the IO dispatcher.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to update launchUndispatched to not provide a dispatcher like this launch(start = CoroutineStart.UNDISPATCHED, block = block) and I wasn't able to break PreviewScreenshots.vector, it seem to pass.

Ok, I uploaded a new commit to start using DeferredDispatchCoroutineScope again in the same way as before. It is slightly slower again, but not dramatically:

Without DeferredDispatchCoroutineScope:
CoilBenchmark.timeToFirstPixel
timeNs            min 2,870,331.0,   median 3,061,765.3,   max 3,858,761.8
allocationCount   min       402.1,   median       402.9,   max       404.1
CoilBenchmark.reuse
timeNs            min 2,102,943.6,   median 2,139,085.8,   max 2,447,290.0
allocationCount   min       220.0,   median       220.0,   max       220.3

With DeferredDispatchCoroutineScope:
CoilBenchmark.timeToFirstPixel
timeNs            min 2,941,350.4,   median 3,085,572.9,   max 3,943,752.3
allocationCount   min       428.1,   median       428.8,   max       430.4
CoilBenchmark.reuse
timeNs            min 2,252,515.0,   median 2,286,389.6,   max 2,636,826.0
allocationCount   min       245.0,   median       245.0,   max       245.4

However, I still believe this code could be simplified further now as we don't collect from the flows in order to restart anymore. It feels like withDeferredDispatch was needed to apply a new dispatcher after flow collection. as we don't collect from the flow anymore we can start with this dispatcher straight away. and if I understand correctly we already start with the needed dispatcher as we apply Dispatchers.Unconfined inside of launchUndispatched. I might be missing something, but it feels like withDeferredDispatch is always a no-op now because of that. And if we don't need withDeferredDispatch, then we don't need DeferredDispatchCoroutineScope as well.

Copy link
Contributor Author

@andkulikov andkulikov Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understood your idea better now and why we need DeferredDispatchCoroutineScope and DeferredDispatchCoroutineDispatcher. I am happy to keep them, but as an extra optimization we can indeed start with DeferredDispatchCoroutineDispatcher straight away from launchUndispatched, instead of first starting with Dispatchers.Unconfined and then immediately switching to DeferredDispatchCoroutineDispatcher. I uploaded it in a new commit and the allocations numbers are improved:

CoilBenchmark.timeToFirstPixel
timeNs            min 2,923,557.9,   median 3,067,062.5,   max 4,294,279.0
allocationCount   min       415.1,   median       415.1,   max       418.1
CoilBenchmark.reuse
timeNs            min 2,160,035.3,   median 2,200,086.6,   max 2,537,839.7
allocationCount   min       233.0,   median       233.0,   max       233.3

With that, I believe, the end behavior should be the same as before my changes and we should be good to go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andkulikov Awesome thanks! I'll also write up a better test to verify the immediate dispatching behaviour. PreviewScreenshots.vector only happens to test this behaviour as a side effect. There are tests for DeferredDispatchCoroutineScope, but they're only unit tests and we need an integration test for this.

replay = 1,
onBufferOverflow = DROP_OLDEST,
)
private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! TIL you can get some extra performance by manually resuming the Continuation.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.

// we are creating painter during the modifier creation. as a result we reuse the same
// painter object when the modifier is being reused as part of lazy layouts reuse flow.
// it allows us to save on quite a lot of allocations during the reuse.
val painter = AsyncImagePainter(input)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat idea!

replay = 1,
onBufferOverflow = DROP_OLDEST,
)
private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.

suspendCoroutine { continuation = it }
oldContinuation?.resume(Unit)
}
return latestConstraints.toSize()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can potentially report different constraints with particular order of invocations / updates to latestConstraints.
I don't think it is a problem here, but ideally you would pass the constraints to Continuation.resume. In general, I feel like it is save to avoid this here to minimize boxing, just wanted to flag this usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think for this logic we always want to use the latest constraints regardless. so should be fine

Copy link
Member

@colinrtwhite colinrtwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge thanks for doing this! These changes will be part of 3.1, which should be released soon.

@colinrtwhite colinrtwhite enabled auto-merge (squash) January 23, 2025 22:50
@colinrtwhite colinrtwhite merged commit d3694d3 into coil-kt:main Jan 23, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants