rls: Synchronization fixes in CachingRlsLbClient #10999

ejona86 · 2024-03-11T18:19:31Z

This started with combining handleNewRequest with asyncRlsCall, but that emphasized pre-existing synchronization issues and trying to fix those exposed others. It was hard to split this into smaller commits because they were interconnected.

handleNewRequest was combined with asyncRlsCall to use a single code flow for handling the completed future while also failing the pick immediately for thottled requests. That flow was then reused for refreshing after backoff and data stale. It no longer optimizes the RPC completing immediately because that would not happen in real life; it only happens in tests because of inprocess+directExecutor() and we don't want to test a different code flow in tests. This did require updating some of the tests.

One small behavior change to share the combined asyncRlsCall with backoff is we now always invalidate an entry after the backoff. Previously the code could replace the entry with its new value in one operation if the asyncRlsCall future completed immediately. That only mattered to a single test which now sees an EXPLICIT eviction.

SynchronizationContext used to provide atomic scheduling in BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was only accessed from the sync context. The same was true for calling up the LB tree with updateBalancingState(). In particular, adding entries to the cache during a pick could evict entries without running the cleanup methods within the context, as well as the RLS channel transitioning from TRANSIENT_FAILURE to READY. This was replaced with using a bare Future with a lock to provide atomicity.

BackoffCacheEntry no longer uses the current time and instead waits for the backoff timer to actually run before considering itself expired. Previously, it could race with periodic cleanup and get evicted before the timer ran, which would cancel the timer and forget the backoffPolicy. Since the backoff timer invalidates the entry, it is likely useless to claim it ever expires, but that level of behavior was preserved since I didn't look into the LRU cache deeply.

propagateRlsError() was moved out of asyncRlsCall because it was not guaranteed to run after the cache was updated. If something was already running on the sync context, then RPCs would hang until another update caused updateBalancingState().

Some methods were moved out of the CacheEntry classes to avoid shared-state mutation in constructors. But if we add something in a factory method, we want to remove it in a sibling method to the factory method, so additional code is moved for symmetry. Moving shared-state mutation ouf of constructors is important because 1) it is surprising and 2) ErrorProne doesn't validate locking within constructors. In general, having shared-state methods in CacheEntries also has the problem that ErrorProne can't validate CachingRlsLbClient calls to CacheEntry. ErrorProne can't know that "lock" is already held because CacheEntry could have been created from a different instance of CachingRlsLbClient and there's no way for us to let ErrorProne prove it is the same instance of "lock".

DataCacheEntry still mutates global state that requires a lock in its constructor, but it is less severe of a problem and it requires more choices to address.

This started with combining handleNewRequest with asyncRlsCall, but that emphasized pre-existing synchronization issues and trying to fix those exposed others. It was hard to split this into smaller commits because they were interconnected. handleNewRequest was combined with asyncRlsCall to use a single code flow for handling the completed future while also failing the pick immediately for thottled requests. That flow was then reused for refreshing after backoff and data stale. It no longer optimizes the RPC completing immediately because that would not happen in real life; it only happens in tests because of inprocess+directExecutor() and we don't want to test a different code flow in tests. This did require updating some of the tests. One small behavior change to share the combined asyncRlsCall with backoff is we now always invalidate an entry after the backoff. Previously the code could replace the entry with its new value in one operation if the asyncRlsCall future completed immediately. That only mattered to a single test which now sees an EXPLICIT eviction. SynchronizationContext used to provide atomic scheduling in BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was only accessed from the sync context. The same was true for calling up the LB tree with `updateBalancingState()`. In particular, adding entries to the cache during a pick could evict entries without running the cleanup methods within the context, as well as the RLS channel transitioning from TRANSIENT_FAILURE to READY. This was replaced with using a bare Future with a lock to provide atomicity. BackoffCacheEntry no longer uses the current time and instead waits for the backoff timer to actually run before considering itself expired. Previously, it could race with periodic cleanup and get evicted before the timer ran, which would cancel the timer and forget the backoffPolicy. Since the backoff timer invalidates the entry, it is likely useless to claim it ever expires, but that level of behavior was preserved since I didn't look into the LRU cache deeply. propagateRlsError() was moved out of asyncRlsCall because it was not guaranteed to run after the cache was updated. If something was already running on the sync context, then RPCs would hang until another update caused updateBalancingState(). Some methods were moved out of the CacheEntry classes to avoid shared-state mutation in constructors. But if we add something in a factory method, we want to remove it in a sibling method to the factory method, so additional code is moved for symmetry. Moving shared-state mutation ouf of constructors is important because 1) it is surprising and 2) ErrorProne doesn't validate locking within constructors. In general, having shared-state methods in CacheEntries also has the problem that ErrorProne can't validate CachingRlsLbClient calls to CacheEntry. ErrorProne can't know that "lock" is already held because CacheEntry could have been created from a _different instance_ of CachingRlsLbClient and there's no way for us to let ErrorProne prove it is the same instance of "lock". DataCacheEntry still mutates global state that requires a lock in its constructor, but it is less severe of a problem and it requires more choices to address.

ejona86 requested a review from larry-safran March 11, 2024 18:19

larry-safran approved these changes Mar 27, 2024

View reviewed changes

ejona86 added 3 commits April 2, 2024 09:45

Merge branch 'master' into rls-sync

7cf1793

fixup: Remove stray function from follow-up

7de759a

fixup: merge

9e63918

ejona86 merged commit 6e97b18 into grpc:master Apr 3, 2024
15 checks passed

ejona86 deleted the rls-sync branch April 3, 2024 19:22

github-actions bot locked as resolved and limited conversation to collaborators Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rls: Synchronization fixes in CachingRlsLbClient #10999

rls: Synchronization fixes in CachingRlsLbClient #10999

ejona86 commented Mar 11, 2024

rls: Synchronization fixes in CachingRlsLbClient #10999

rls: Synchronization fixes in CachingRlsLbClient #10999

Conversation

ejona86 commented Mar 11, 2024