Implement a auto re-trying wrapper backend #835

grote · 2025-01-03T21:37:58Z

Data to be saved to the backend will now be kept in memory and not directly written to streams. So watch out for out of memory errors!

This allows us to keep the data available for re-tries. If a network request times out or fails in a non-permanent way, the new RetryBackend will automatically retry all backend operations. Over time, we can find out more transient exceptions and add them to the list. For now, we start small with few real tr-try cases.

For ease of combined testing, this MR contains commits that fix unrelated issues:

Closes #796, #828, #830, #841

…ork with it

See: https://developer.android.com/develop/background-work/background-tasks/broadcasts#android-14 for why we don't get broadcasts while the system has us in cached state

We forgot the app size, so when restoring apps, their data size was unknown

t-m-w · 2025-01-08T22:46:11Z

So #796 is what this PR itself refers to, and the others listed are the unrelated ones? I left comments on the others, which appear fixed, but I still need to test the main issue.

grote · 2025-01-09T12:35:44Z

and the others listed are the unrelated ones?

Yes.

I left comments on the others, which appear fixed, but I still need to test the main issue.

Thanks!

t-m-w · 2025-01-09T16:12:13Z

Test case 1 (with WebDAV)

Turn off mobile data and connect to Wi-Fi.
Start an app backup.
Stray from Wi-Fi for a minute or two.
Return to Wi-Fi.

Expected result

Backup retries transient failures and ultimately completes successfully.(?)

Actual result

Backup claimed to finish successfully. However, during the process, the backup got stuck for a while on RHVoice while the network was not available. When using the Restore functionality, RHVoice is shown with "Last backup: Never (0 B)", but that is not the case for the previous backup - so I think although the backup "finished successfully", on the apps which failed, it gave up. Still working on replicating this.

Log

Full log shared privately with filename 20250109-app-backup-success-after-brief-disconnect.log:

snippet

01-09 09:52:23.854  1000  1669 24109 W PFTBT   : Error -1000 backing up com.github.olga_yakovleva.rhvoice.android
01-09 09:52:23.860 10168 11207 30649 I BackupCoordinator: Request full backup time. Returned 3600000
01-09 09:52:23.861  1000  1669 24109 I PFTBT   : Transport suggested backoff=3600000
01-09 09:52:23.861  1000  1669 24109 W PFTBT   : Transport failed; aborting backup: -1000
01-09 09:52:23.861 10168 11207 30649 I NotificationBackupObserver: Completed. Target: com.github.olga_yakovleva.rhvoice.android, status: -1
000
01-09 09:52:23.862  1000  1669 24109 W ActivityManager: Unbinding backup agent with no active backup
01-09 09:52:23.862  1000  1669 24109 W ActivityManager: Process/uid not found attempting kill of com.github.olga_yakovleva.rhvoice.android:
BackupAgent / 10267
01-09 09:52:23.862  1000  1669 24109 I PFTBT   : Full backup completed with status: -1000
01-09 09:52:23.863 10168 11207 30649 I BackupRequester: Requesting backup for 400 of 433 packages...
01-09 09:52:23.863  1000  1669 24809 E BackupManagerService: Error backing up com.github.olga_yakovleva.rhvoice.android: write failed: EPIPE (Broken pipe)
01-09 09:52:23.864  1000  1669 24809 W ActivityManager: Unbinding backup agent with no active backup
01-09 09:52:23.864  1000  1669 24809 W ActivityManager: Process/uid not found attempting kill of com.github.olga_yakovleva.rhvoice.android:BackupAgent / 10267
01-09 09:52:23.868 10168 11207 11207 I BackupStateManager: B - appBackupRunning: false, filesBackupRunning: false, appBackupWorker: ENQUEUED
01-09 09:52:23.869 10168 11207 11207 D ConfigurableBackupTransportService: Service destroyed.
01-09 09:52:23.872  1000  1669 24109 I PFTBT   : Full data backup pass finished.

t-m-w · 2025-01-09T22:21:15Z

Re Test case 1 above: @grote and I were perplexed to see lines like E FullBackup: Error handling backup data for com.github.olga_yakovleva.rhvoice.android: with nothing following them, because it is supposed to show the IOException stack trace. It turns out that UnknownHostException is explicitly filtered from logging in multiple spots, so we think that's what the exception must have been in this case. It's not currently handled by the auto retry backend, but perhaps it should be.

grote · 2025-01-10T14:36:14Z

I pushed some changes that new will retry UnknownHostException and uses linear backoff, so we are less likely to hit tight system timeouts while retrying.

…system timeout

t-m-w · 2025-01-14T21:45:53Z

I see this is working for the newly-added UnknownHostException, and I can confirm that the retried package shows in Restore:

01-14 16:31:24.262  1000  1654 31903 D BackupManagerService: Calling doFullBackup() on com.keylesspalace.tusky
01-14 16:31:24.703 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #1 after error and delay 2500ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
01-14 16:31:27.213 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #2 after error and delay 5000ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
01-14 16:31:32.223 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #3 after error and delay 7500ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
01-14 16:31:39.730 10168 18214 28582 W o.c.s.c.b.RetryBackend: [binder:18214_A      ] Retrying #4 after error and delay 10000ms: java.net.UnknownHostException: Unable to resolve host "redacted-dav.example.com": No address associated with hostname
01-14 16:31:50.394 10168 18214 28582 I FullBackup: Finish full backup of com.keylesspalace.tusky. Wrote 1234567 bytes

There is an additional unhandled case that came up in one (separate) trial, wherein the whole backup ultimately failed:

01-14 16:21:26.022 10168 18214 19642 E FullBackup: Error handling backup data for us.spotco.fennec_dos: 
01-14 16:21:26.022 10168 18214 19642 E FullBackup: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR

full snippet

01-14 16:21:26.022 10168 18214 19642 E FullBackup: Error handling backup data for us.spotco.fennec_dos: 
01-14 16:21:26.022 10168 18214 19642 E FullBackup: okhttp3.internal.http2.StreamResetException: stream was reset: PROTOCOL_ERROR
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.kt:148)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http2.Http2ExchangeCodec.readResponseHeaders(Http2ExchangeCodec.kt:97)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:110)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:93)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.BasicDigestAuthHandler.intercept(BasicDigestAuthHandler.kt:303)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:76)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:201)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at okhttp3.internal.connection.RealCall.execute(RealCall.kt:154)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put$lambda$18(DavResource.kt:465)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.$r8$lambda$Vf0bD867Wnl2_qIkb4GLmFtj5hE(Unknown Source:0)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource$$ExternalSyntheticLambda5.invoke(D8$$SyntheticClass:0)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.followRedirects$build(DavResource.kt:666)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put(DavResource.kt:450)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at at.bitfire.dav4jvm.DavResource.put$default(DavResource.kt:448)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.webdav.WebDavBackend.save(WebDavBackend.kt:137)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invokeSuspend(RetryBackend.kt:34)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invoke(Unknown Source:8)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend$save$2.invoke(Unknown Source:2)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.retry(RetryBackend.kt:70)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.retry(RetryBackend.kt:65)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at org.calyxos.seedvault.core.backends.RetryBackend.save(RetryBackend.kt:32)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BlobCreator.createNewBlob(BlobCreator.kt:78)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BackupReceiver.onNewChunk(BackupReceiver.kt:119)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.repo.BackupReceiver.addBytes(BackupReceiver.kt:63)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.backup.FullBackup.sendBackupData(FullBackup.kt:128)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.backup.BackupCoordinator.sendBackupData(BackupCoordinator.kt:257)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.ConfigurableBackupTransport$sendBackupData$1.invokeSuspend(ConfigurableBackupTransport.kt:195)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:108)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:280)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BlockingCoroutine.joinBlocking(Builders.kt:85)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking(Builders.kt:59)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt.runBlocking(Unknown Source:1)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt__BuildersKt.runBlocking$default(Builders.kt:38)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at kotlinx.coroutines.BuildersKt.runBlocking$default(Unknown Source:1)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.stevesoltys.seedvault.transport.ConfigurableBackupTransport.sendBackupData(ConfigurableBackupTransport.kt:194)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.app.backup.BackupTransport$TransportImpl.sendBackupData(BackupTransport.java:900)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at com.android.internal.backup.IBackupTransport$Stub.onTransact(IBackupTransport.java:778)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.os.Binder.execTransactInternal(Binder.java:1396)
01-14 16:21:26.022 10168 18214 19642 E FullBackup:      at android.os.Binder.execTransact(Binder.java:1335)

grote · 2025-01-15T13:17:18Z

Thanks for testing! I just added the StreamResetException as well, but note that we don't need to find all transient exceptions in this PR. This can happen later as well.

t-m-w · 2025-01-15T13:49:22Z

Understood! The StreamResetException just came up in my first test and made it appear that this change change was ineffective until looking at logs to see why.

t-m-w

Related issues appear to be fixed.

Quick test of backup/restore-after-wipe worked fine.

Did not encounter any breakage in real-world use or otherwise.

grote added 5 commits January 3, 2025 17:57

File BackupService: don't restart automatically and log self stopping

6ee31f4

Move MemoryLogger to core so we can use it in storage lib as well

73da56a

Implement new Backend.save() method that can be re-tried

bd58d57

Use new Backend.save() method with app backup

98df907

Use new Backend.save() method with file backup

255f1db

grote requested a review from t-m-w January 6, 2025 18:50

grote force-pushed the retry-backend branch 8 times, most recently from f22428c to f41a8df Compare January 7, 2025 21:14

grote added 16 commits January 7, 2025 18:46

Remove old Backend.save() method and hacky WebDAV pipe code with it

cac71a5

Implement new RetryBackend and wrap it around other backends

86750c7

Add logging around StaleDataException

fcde826

Introduce a BackendId, so RetryBackend doesn't mess up class comparisons

29cb6fa

Allow BlobCreator's saving to be re-tried

8edc3bc

Add test that snapshot saving can be re-tried

8ee538a

Ensure that file backup Snapshot saving can be re-tried

697c9f4

Ensure ChunkWriter saving can be re-tried

68b5d35

Don't schedule/cancel workers two times after setting new backend

40def41

Throw exception from SafHelper#findFileBlocking so RetryBackend can w…

122a5e6

…ork with it

Add TODO to StorageBackup#init

c2a678f

Use UsbMonitorService in foreground to wait for USB root

f892b8a

See: https://developer.android.com/develop/background-work/background-tasks/broadcasts#android-14 for why we don't get broadcasts while the system has us in cached state

Fix: Convert app size from snapshot to old metadata

0187f01

We forgot the app size, so when restoring apps, their data size was unknown

log logcat command at the beginning

841b9b8

log when file backup gets scheduled and add TODOs

a0c50ae

Introduce SafRetryException to make SafBackend more resilient

9a32185

Save build reports on cirrus CI

1597c9e

grote force-pushed the retry-backend branch from f41a8df to 1597c9e Compare January 7, 2025 21:46

This was referenced Jan 8, 2025

StaleDataException: Attempted to access a cursor after it has been closed. #828

Closed

Both backup processes happening at once (app backup and file backup running at the same time) #830

Closed

System apps item shows 0 bytes on restore screen #841

Closed

lucasmz-dev mentioned this pull request Jan 9, 2025

Backup doesn't trigger when USB is plugged in #735

Open

grote added 3 commits January 10, 2025 11:47

Remove keepTrying() for saving snapshots as we now have RetryBackend

a1bde2e

Use linear delay for RetryBackend to make it less likely to run into …

89c8bad

…system timeout

Treat UnknownHostException as transient, so it gets retried

ee85b99

grote force-pushed the retry-backend branch from f3c8c8a to ee85b99 Compare January 10, 2025 14:47

grote mentioned this pull request Jan 10, 2025

Tell system to reinitialize when it refuses to give us data we need #845

Merged

Classify StreamResetException transient in WebDavBackend

3dcd5e9

t-m-w approved these changes Jan 15, 2025

View reviewed changes

grote merged commit eb72d37 into seedvault-app:android15 Jan 15, 2025
2 checks passed

grote deleted the retry-backend branch January 15, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a auto re-trying wrapper backend #835

Implement a auto re-trying wrapper backend #835

grote commented Jan 3, 2025 •

edited

Loading

t-m-w commented Jan 8, 2025

grote commented Jan 9, 2025

t-m-w commented Jan 9, 2025 •

edited

Loading

t-m-w commented Jan 9, 2025

grote commented Jan 10, 2025

t-m-w commented Jan 14, 2025 •

edited

Loading

grote commented Jan 15, 2025

t-m-w commented Jan 15, 2025

t-m-w left a comment

Implement a auto re-trying wrapper backend #835

Implement a auto re-trying wrapper backend #835

Conversation

grote commented Jan 3, 2025 • edited Loading

t-m-w commented Jan 8, 2025

grote commented Jan 9, 2025

t-m-w commented Jan 9, 2025 • edited Loading

Test case 1 (with WebDAV)

Expected result

Actual result

Log

t-m-w commented Jan 9, 2025

grote commented Jan 10, 2025

t-m-w commented Jan 14, 2025 • edited Loading

grote commented Jan 15, 2025

t-m-w commented Jan 15, 2025

t-m-w left a comment

Choose a reason for hiding this comment

grote commented Jan 3, 2025 •

edited

Loading

t-m-w commented Jan 9, 2025 •

edited

Loading

t-m-w commented Jan 14, 2025 •

edited

Loading