Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a auto re-trying wrapper backend #835

Merged
merged 26 commits into from
Jan 15, 2025

Conversation

grote
Copy link
Collaborator

@grote grote commented Jan 3, 2025

Data to be saved to the backend will now be kept in memory and not directly written to streams. So watch out for out of memory errors!

This allows us to keep the data available for re-tries. If a network request times out or fails in a non-permanent way, the new RetryBackend will automatically retry all backend operations. Over time, we can find out more transient exceptions and add them to the list. For now, we start small with few real tr-try cases.

For ease of combined testing, this MR contains commits that fix unrelated issues:

Closes #796, #828, #830, #841

@grote grote requested a review from t-m-w January 6, 2025 18:50
@grote grote force-pushed the retry-backend branch 8 times, most recently from f22428c to f41a8df Compare January 7, 2025 21:14
@t-m-w
Copy link
Collaborator

t-m-w commented Jan 8, 2025

So #796 is what this PR itself refers to, and the others listed are the unrelated ones? I left comments on the others, which appear fixed, but I still need to test the main issue.

@grote
Copy link
Collaborator Author

grote commented Jan 9, 2025

and the others listed are the unrelated ones?

Yes.

I left comments on the others, which appear fixed, but I still need to test the main issue.

Thanks!

@t-m-w
Copy link
Collaborator

t-m-w commented Jan 9, 2025

Test case 1 (with WebDAV)

  1. Turn off mobile data and connect to Wi-Fi.
  2. Start an app backup.
  3. Stray from Wi-Fi for a minute or two.
  4. Return to Wi-Fi.

Expected result

Backup retries transient failures and ultimately completes successfully.(?)

Actual result

Backup claimed to finish successfully. However, during the process, the backup got stuck for a while on RHVoice while the network was not available. When using the Restore functionality, RHVoice is shown with "Last backup: Never (0 B)", but that is not the case for the previous backup - so I think although the backup "finished successfully", on the apps which failed, it gave up. Still working on replicating this.

Log

Full log shared privately with filename 20250109-app-backup-success-after-brief-disconnect.log:

snippet
01-09 09:52:23.854  1000  1669 24109 W PFTBT   : Error -1000 backing up com.github.olga_yakovleva.rhvoice.android
01-09 09:52:23.860 10168 11207 30649 I BackupCoordinator: Request full backup time. Returned 3600000
01-09 09:52:23.861  1000  1669 24109 I PFTBT   : Transport suggested backoff=3600000
01-09 09:52:23.861  1000  1669 24109 W PFTBT   : Transport failed; aborting backup: -1000
01-09 09:52:23.861 10168 11207 30649 I NotificationBackupObserver: Completed. Target: com.github.olga_yakovleva.rhvoice.android, status: -1
000
01-09 09:52:23.862  1000  1669 24109 W ActivityManager: Unbinding backup agent with no active backup
01-09 09:52:23.862  1000  1669 24109 W ActivityManager: Process/uid not found attempting kill of com.github.olga_yakovleva.rhvoice.android:
BackupAgent / 10267
01-09 09:52:23.862  1000  1669 24109 I PFTBT   : Full backup completed with status: -1000
01-09 09:52:23.863 10168 11207 30649 I BackupRequester: Requesting backup for 400 of 433 packages...
01-09 09:52:23.863  1000  1669 24809 E BackupManagerService: Error backing up com.github.olga_yakovleva.rhvoice.android: write failed: EPIPE (Broken pipe)
01-09 09:52:23.864  1000  1669 24809 W ActivityManager: Unbinding backup agent with no active backup
01-09 09:52:23.864  1000  1669 24809 W ActivityManager: Process/uid not found attempting kill of com.github.olga_yakovleva.rhvoice.android:BackupAgent / 10267
01-09 09:52:23.868 10168 11207 11207 I BackupStateManager: B - appBackupRunning: false, filesBackupRunning: false, appBackupWorker: ENQUEUED
01-09 09:52:23.869 10168 11207 11207 D ConfigurableBackupTransportService: Service destroyed.
01-09 09:52:23.872  1000  1669 24109 I PFTBT   : Full data backup pass finished.

@t-m-w
Copy link
Collaborator

t-m-w commented Jan 9, 2025

Re Test case 1 above: @grote and I were perplexed to see lines like E FullBackup: Error handling backup data for com.github.olga_yakovleva.rhvoice.android: with nothing following them, because it is supposed to show the IOException stack trace. It turns out that UnknownHostException is explicitly filtered from logging in multiple spots, so we think that's what the exception must have been in this case. It's not currently handled by the auto retry backend, but perhaps it should be.

@grote
Copy link
Collaborator Author

grote commented Jan 10, 2025

I pushed some changes that new will retry UnknownHostException and uses linear backoff, so we are less likely to hit tight system timeouts while retrying.

@t-m-w
Copy link
Collaborator

t-m-w commented Jan 14, 2025

  1. I see this is working for the newly-added UnknownHostException, and I can confirm that the retried package shows in Restore:

    01-14 16:31:24.262  1000  1654 31903 D BackupManagerService: Calling doFullBackup() on com.keylesspalace.tusky
    01-14 16:31:24.703 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #1 after error and delay 2500ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
    01-14 16:31:27.213 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #2 after error and delay 5000ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
    01-14 16:31:32.223 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #3 after error and delay 7500ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
    01-14 16:31:39.730 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #4 after error and delay 10000ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
    01-14 16:31:50.394 10168 18214 28582 I FullBackup: Finish full backup of com.keylesspalace.tusky. Wrote 1234567 bytes
    
  2. There is an additional unhandled case that came up in one (separate) trial, wherein the whole backup ultimately failed:

    01-14 16:21:26.022 10168 18214 19642 E FullBackup: Error handling backup data for us.spotco.fennec_dos: 
    01-14 16:21:26.022 10168 18214 19642 E FullBackup: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR
    
    full snippet
    01-14 16:21:26.022 10168 18214 19642 E FullBackup: Error handling backup data for us.spotco.fennec_dos: 
    01-14 16:21:26.022 10168 18214 19642 E FullBackup: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.kt:148)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http2.Http2ExchangeCodec.readResponseHeaders(Http2ExchangeCodec.kt:97)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.BasicDigestAuthHandler.intercept(BasicDigestAuthHandler.kt:303)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put$lambda$18(DavResource.kt:465)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.$r8$lambda$Vf0bD867Wnl2_qIkb4GLmFtj5hE(Unknown Source:0)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource$$ExternalSyntheticLambda5.invoke(D8$$SyntheticClass:0)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.followRedirects$build(DavResource.kt:666)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put(DavResource.kt:450)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put$default(DavResource.kt:448)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.webdav.WebDavBackend.save(WebDavBackend.kt:137)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invokeSuspend(RetryBackend.kt:34)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invoke(Unknown Source:8)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invoke(Unknown Source:2)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.retry(RetryBackend.kt:70)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.retry(RetryBackend.kt:65)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.save(RetryBackend.kt:32)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BlobCreator.createNewBlob(BlobCreator.kt:78)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BackupReceiver.onNewChunk(BackupReceiver.kt:119)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BackupReceiver.addBytes(BackupReceiver.kt:63)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.backup.FullBackup.sendBackupData(FullBackup.kt:128)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.backup.BackupCoordinator.sendBackupData(BackupCoordinator.kt:257)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.ConfigurableBackupTransport$sendBackupData$1.invokeSuspend(ConfigurableBackupTransport.kt:195)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:108)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source:1)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source:1)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.ConfigurableBackupTransport.sendBackupData(ConfigurableBackupTransport.kt:194)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.app.backup.BackupTransport$TransportImpl.sendBackupData(BackupTransport.java:900)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.android.internal.backup.IBackupTransport$Stub.onTransact(IBackupTransport.java:778)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.os.Binder.execTransactInternal(Binder.java:1396)
    01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.os.Binder.execTransact(Binder.java:1335)
    

@grote
Copy link
Collaborator Author

grote commented Jan 15, 2025

Thanks for testing! I just added the StreamResetException as well, but note that we don't need to find all transient exceptions in this PR. This can happen later as well.

@t-m-w
Copy link
Collaborator

t-m-w commented Jan 15, 2025

Understood! The StreamResetException just came up in my first test and made it appear that this change change was ineffective until looking at logs to see why.

Copy link
Collaborator

@t-m-w t-m-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related issues appear to be fixed.

Quick test of backup/restore-after-wipe worked fine.

Did not encounter any breakage in real-world use or otherwise.

@grote grote merged commit eb72d37 into seedvault-app:android15 Jan 15, 2025
2 checks passed
@grote grote deleted the retry-backend branch January 15, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement a auto re-trying wrapper backend
2 participants