-
Notifications
You must be signed in to change notification settings - Fork 677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations for AsyncImage #2795
Conversation
Before CoilBenchmark.reuse timeNs min 3,519,301.2, median 3,566,884.4, max 4,548,294.8 instructions min 709,144.6, median 710,586.8, max 831,324.2 l1DMisses min 22,150.8, median 22,255.0, max 26,290.9 branchMisses min 20,426.8, median 20,503.2, max 35,141.8 allocationCount min 420.0, median 420.0, max 420.5 CoilBenchmark.timeToFirstPixel timeNs min 3,776,971.3, median 3,924,422.8, max 5,283,567.1 instructions min 733,797.9, median 739,403.5, max 906,632.5 l1DMisses min 22,722.0, median 23,084.3, max 29,304.1 branchMisses min 21,959.9, median 22,170.1, max 50,751.5 allocationCount min 609.6, median 609.6, max 617.8 After CoilBenchmark.reuse timeNs min 2,089,110.8, median 2,131,318.5, max 2,545,147.8 instructions min 315,215.8, median 315,790.5, max 376,257.2 l1DMisses min 13,759.1, median 13,830.8, max 15,850.1 branchMisses min 11,412.5, median 11,447.6, max 19,016.3 allocationCount min 220.0, median 220.0, max 221.4 CoilBenchmark.timeToFirstPixel timeNs min 2,868,448.7, median 3,091,769.0, max 4,091,858.8 instructions min 396,472.7, median 400,086.9, max 603,784.3 l1DMisses min 16,288.2, median 16,459.6, max 23,505.5 branchMisses min 14,539.8, median 14,677.9, max 45,211.0 allocationCount min 402.0, median 402.2, max 406.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks so much for creating the benchmarks + investigating. Left a few comments, but I'm on board with all the changes.
} | ||
} | ||
}.collect() | ||
rememberJob = scope.launchUndispatched { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might still need DeferredDispatchCoroutineScope
here and withDeferredDispatch
below. More info here, but without it we'll dispatch when we reach the first withContext
in RealImageLoader
and miss returning from the memory cache synchronously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I saw this PR and after reading it I thought that maybe this simpler version will work as well, because there you mention:
"To fix this I tried using CoroutineStart.UNDISPATCHED with a StateFlow. This works for initial composition, but does not resolve synchronously from the memory cache on subsequent requests (e.g. when the AsyncImage(model) changes)."
I did keep the part where we launch the coroutine with CoroutineStart.UNDISPATCHED, so the initial composition should work (and it works fine in my testing, but I test not in the Paparazzi environment obviously).
and for the subsequent requests when model changes it might work fine as well because we just manually cancel the previous job and start it from scratch in the same way as initially.
what do you think? is there a way to test the use case from there with the code from this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andkulikov You're right I think that relaunching the image request whenever input
changes solves that problem, however there other issues with using CoroutineStart.UNDISPATCHED
that I failed to mention. Overall there are many ways to accidentally trigger a dispatch when using CoroutineStart.UNDISPATCHED
, which is why I opted to create DeferredDispatchCoroutineScope
. Using withContext
, collect
, or other coroutine operators will call CoroutineDispatcher.isDispatchNeeded
and trigger a dispatch.
The PreviewScreenshots.vector
test will fail if dispatching occurs. It's currently passing as launchUndispatched
overwrites the Compose main thread dispatcher with Dispatchers.Unconfined
. This was done to avoid dispatching before hitting withDeferredDispatch
. If I remove Dispatchers.Unconfined
from launchUndispatched
then the test fails. You can use ./gradlew updateDebugScreenshotTest
to quickly rerun the preview screenshot tests.
The goal of DeferredDispatchCoroutineScope
/withDeferredDispatch
was to avoid dispatching until the coroutine context's CoroutineDispatcher
changes. This way we can execute synchronously until we reach the memory cache check or - in the case of previews/tests - the whole image pipeline will execute synchronously as we set EmptyCoroutineContext
for fetcherCoroutineContext
/decoderCoroutineContext
instead of the IO dispatcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to update launchUndispatched
to not provide a dispatcher like this launch(start = CoroutineStart.UNDISPATCHED, block = block)
and I wasn't able to break PreviewScreenshots.vector
, it seem to pass.
Ok, I uploaded a new commit to start using DeferredDispatchCoroutineScope again in the same way as before. It is slightly slower again, but not dramatically:
Without DeferredDispatchCoroutineScope:
CoilBenchmark.timeToFirstPixel
timeNs min 2,870,331.0, median 3,061,765.3, max 3,858,761.8
allocationCount min 402.1, median 402.9, max 404.1
CoilBenchmark.reuse
timeNs min 2,102,943.6, median 2,139,085.8, max 2,447,290.0
allocationCount min 220.0, median 220.0, max 220.3
With DeferredDispatchCoroutineScope:
CoilBenchmark.timeToFirstPixel
timeNs min 2,941,350.4, median 3,085,572.9, max 3,943,752.3
allocationCount min 428.1, median 428.8, max 430.4
CoilBenchmark.reuse
timeNs min 2,252,515.0, median 2,286,389.6, max 2,636,826.0
allocationCount min 245.0, median 245.0, max 245.4
However, I still believe this code could be simplified further now as we don't collect from the flows in order to restart anymore. It feels like withDeferredDispatch
was needed to apply a new dispatcher after flow collection. as we don't collect from the flow anymore we can start with this dispatcher straight away. and if I understand correctly we already start with the needed dispatcher as we apply Dispatchers.Unconfined
inside of launchUndispatched
. I might be missing something, but it feels like withDeferredDispatch
is always a no-op now because of that. And if we don't need withDeferredDispatch
, then we don't need DeferredDispatchCoroutineScope
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understood your idea better now and why we need DeferredDispatchCoroutineScope and DeferredDispatchCoroutineDispatcher. I am happy to keep them, but as an extra optimization we can indeed start with DeferredDispatchCoroutineDispatcher straight away from launchUndispatched, instead of first starting with Dispatchers.Unconfined and then immediately switching to DeferredDispatchCoroutineDispatcher. I uploaded it in a new commit and the allocations numbers are improved:
CoilBenchmark.timeToFirstPixel
timeNs min 2,923,557.9, median 3,067,062.5, max 4,294,279.0
allocationCount min 415.1, median 415.1, max 418.1
CoilBenchmark.reuse
timeNs min 2,160,035.3, median 2,200,086.6, max 2,537,839.7
allocationCount min 233.0, median 233.0, max 233.3
With that, I believe, the end behavior should be the same as before my changes and we should be good to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andkulikov Awesome thanks! I'll also write up a better test to verify the immediate dispatching behaviour. PreviewScreenshots.vector
only happens to test this behaviour as a side effect. There are tests for DeferredDispatchCoroutineScope
, but they're only unit tests and we need an integration test for this.
coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt
Show resolved
Hide resolved
coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt
Show resolved
Hide resolved
coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt
Show resolved
Hide resolved
replay = 1, | ||
onBufferOverflow = DROP_OLDEST, | ||
) | ||
private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! TIL you can get some extra performance by manually resuming the Continuation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.
// we are creating painter during the modifier creation. as a result we reuse the same | ||
// painter object when the modifier is being reused as part of lazy layouts reuse flow. | ||
// it allows us to save on quite a lot of allocations during the reuse. | ||
val painter = AsyncImagePainter(input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat idea!
replay = 1, | ||
onBufferOverflow = DROP_OLDEST, | ||
) | ||
private var latestConstraints: Constraints = Constraints(maxWidth = 0, maxHeight = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any usage of kotlinx.coroutines in general goes through atomics, which are very slow compared to manual management of continuations / control flow.
suspendCoroutine { continuation = it } | ||
oldContinuation?.resume(Unit) | ||
} | ||
return latestConstraints.toSize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can potentially report different constraints with particular order of invocations / updates to latestConstraints.
I don't think it is a problem here, but ideally you would pass the constraints to Continuation.resume
. In general, I feel like it is save to avoid this here to minimize boxing, just wanted to flag this usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think for this logic we always want to use the latest constraints regardless. so should be fine
coil-compose-core/src/commonMain/kotlin/coil3/compose/internal/ContentPainterModifier.kt
Outdated
Show resolved
Hide resolved
coil-compose-core/src/commonMain/kotlin/coil3/compose/AsyncImagePainter.kt
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge thanks for doing this! These changes will be part of 3.1
, which should be released soon.
Hello, Coil maintainers!
I am from Compose team at Google and as part of our recent internal hackathon I tried to find opportunities to improve the performance of this library. I did manage to make noticeable improvements, and I hope we can work together on upstreaming them.
I did write a simple benchmark for verifying the improvements, however I am not contributing it as part of this pull request, as it used some tooling we use internally, but didn't release as a public library, unfortunately.
I was benchmarking this composable:
and wrote two benchmarks:
• timeToFirstPixel. The benchmark combining the initial composition, measure and draw using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarkFirstExtensions.kt;l=35;drc=dcaa116fbfda77e64a319e1668056ce3b032469f
• reuse. The benchmark measuring the work needed to be performed when we reuse composables of the same structure during the lazy lists scrolling using this util: https://cs.android.com/androidx/platform/frameworks/support/+/androidx-main:compose/benchmark-utils/src/main/java/androidx/compose/testutils/benchmark/BenchmarksExtensions.kt;l=435;drc=86a86e1870faa9f1b475d54e37199f0fdb35f115
The end numbers I've got with this change:
• timeToFirstPixel improved by 25% in timeNs and 35% in allocations.
• reuse improved by 40% in timeNs and 48% in allocations.
Main changes are:
• coroutines logic from AsyncImagePainter is simplified. the previous one was adding a lot of unnecessary overhead, every coroutine launch is adding a lot of extra work and allocations. I reimplemented it without subscribing for two flows in order to restart the job when the input did change or if the restart is manually requested. from my testing, it seems like the observed logic didn't change and we still do the same work in the same order. I hope it is true for all the possible corner cases, please let me know if you think I missed something important.
• flow observation from ConstraintsSizeResolver was replaced with a simpler manual coroutine restoration once we have the size. even that conceptually it is natural to use flow for such use case, this change was also showing a noticeable improvement in the benchmark. I wish common coroutine constructs like this one didn't have that much of the performance overhead.
• the code is now more optimized for reuse use case: the main important bit here is we do reuse Modifier.Node objects, but we recreate everything we remember in composition. in order to save on a lot of allocations I restructured the code so instead of doing it in composition we create AsyncImagePainter once during the Modifier.Node creation and don't re-allocate it when reuse. unfortunately, I wasn't able to apply the same optimization for SubcomposeAsyncImage as there the painter is exposed in the public api used in composition.
• all the modifiers are combined into one to be more efficient and have a smarter invalidation. we don't add ConstraintsSizeResolver as a modifier anymore, instead we call it right from our main modifier. semantic modifier is moved as part of our main modifier as well. we don't add clipToBounds() modifier anymore as clipping manually as part of the drawing is more efficient compared to introducing extra GraphicsLayer (RenderNode) for it
Full benchmark results: