Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rls: Synchronization fixes in CachingRlsLbClient #10999

Merged
merged 4 commits into from
Apr 3, 2024
Merged

Conversation

ejona86
Copy link
Member

@ejona86 ejona86 commented Mar 11, 2024

This started with combining handleNewRequest with asyncRlsCall, but that emphasized pre-existing synchronization issues and trying to fix those exposed others. It was hard to split this into smaller commits because they were interconnected.

handleNewRequest was combined with asyncRlsCall to use a single code flow for handling the completed future while also failing the pick immediately for thottled requests. That flow was then reused for refreshing after backoff and data stale. It no longer optimizes the RPC completing immediately because that would not happen in real life; it only happens in tests because of inprocess+directExecutor() and we don't want to test a different code flow in tests. This did require updating some of the tests.

One small behavior change to share the combined asyncRlsCall with backoff is we now always invalidate an entry after the backoff. Previously the code could replace the entry with its new value in one operation if the asyncRlsCall future completed immediately. That only mattered to a single test which now sees an EXPLICIT eviction.

SynchronizationContext used to provide atomic scheduling in BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was only accessed from the sync context. The same was true for calling up the LB tree with updateBalancingState(). In particular, adding entries to the cache during a pick could evict entries without running the cleanup methods within the context, as well as the RLS channel transitioning from TRANSIENT_FAILURE to READY. This was replaced with using a bare Future with a lock to provide atomicity.

BackoffCacheEntry no longer uses the current time and instead waits for the backoff timer to actually run before considering itself expired. Previously, it could race with periodic cleanup and get evicted before the timer ran, which would cancel the timer and forget the backoffPolicy. Since the backoff timer invalidates the entry, it is likely useless to claim it ever expires, but that level of behavior was preserved since I didn't look into the LRU cache deeply.

propagateRlsError() was moved out of asyncRlsCall because it was not guaranteed to run after the cache was updated. If something was already running on the sync context, then RPCs would hang until another update caused updateBalancingState().

Some methods were moved out of the CacheEntry classes to avoid shared-state mutation in constructors. But if we add something in a factory method, we want to remove it in a sibling method to the factory method, so additional code is moved for symmetry. Moving shared-state mutation ouf of constructors is important because 1) it is surprising and 2) ErrorProne doesn't validate locking within constructors. In general, having shared-state methods in CacheEntries also has the problem that ErrorProne can't validate CachingRlsLbClient calls to CacheEntry. ErrorProne can't know that "lock" is already held because CacheEntry could have been created from a different instance of CachingRlsLbClient and there's no way for us to let ErrorProne prove it is the same instance of "lock".

DataCacheEntry still mutates global state that requires a lock in its constructor, but it is less severe of a problem and it requires more choices to address.

This started with combining handleNewRequest with asyncRlsCall, but that
emphasized pre-existing synchronization issues and trying to fix those
exposed others. It was hard to split this into smaller commits because
they were interconnected.

handleNewRequest was combined with asyncRlsCall to use a single code
flow for handling the completed future while also failing the pick
immediately for thottled requests. That flow was then reused for
refreshing after backoff and data stale. It no longer optimizes the RPC
completing immediately because that would not happen in real life; it
only happens in tests because of inprocess+directExecutor() and we don't
want to test a different code flow in tests. This did require updating
some of the tests.

One small behavior change to share the combined asyncRlsCall with
backoff is we now always invalidate an entry after the backoff.
Previously the code could replace the entry with its new value in one
operation if the asyncRlsCall future completed immediately. That only
mattered to a single test which now sees an EXPLICIT eviction.

SynchronizationContext used to provide atomic scheduling in
BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was
only accessed from the sync context. The same was true for calling up
the LB tree with `updateBalancingState()`. In particular, adding entries
to the cache during a pick could evict entries without running the
cleanup methods within the context, as well as the RLS channel
transitioning from TRANSIENT_FAILURE to READY. This was replaced with
using a bare Future with a lock to provide atomicity.

BackoffCacheEntry no longer uses the current time and instead waits for
the backoff timer to actually run before considering itself expired.
Previously, it could race with periodic cleanup and get evicted before
the timer ran, which would cancel the timer and forget the
backoffPolicy. Since the backoff timer invalidates the entry, it is
likely useless to claim it ever expires, but that level of behavior was
preserved since I didn't look into the LRU cache deeply.

propagateRlsError() was moved out of asyncRlsCall because it was not
guaranteed to run after the cache was updated. If something was already
running on the sync context, then RPCs would hang until another update
caused updateBalancingState().

Some methods were moved out of the CacheEntry classes to avoid
shared-state mutation in constructors. But if we add something in a
factory method, we want to remove it in a sibling method to the factory
method, so additional code is moved for symmetry. Moving shared-state
mutation ouf of constructors is important because 1) it is surprising
and 2) ErrorProne doesn't validate locking within constructors. In
general, having shared-state methods in CacheEntries also has the
problem that ErrorProne can't validate CachingRlsLbClient calls to
CacheEntry. ErrorProne can't know that "lock" is already held because
CacheEntry could have been created from a _different instance_ of
CachingRlsLbClient and there's no way for us to let ErrorProne prove it
is the same instance of "lock".

DataCacheEntry still mutates global state that requires a lock in its
constructor, but it is less severe of a problem and it requires more
choices to address.
@ejona86 ejona86 requested a review from larry-safran March 11, 2024 18:19
@ejona86 ejona86 merged commit 6e97b18 into grpc:master Apr 3, 2024
15 checks passed
@ejona86 ejona86 deleted the rls-sync branch April 3, 2024 19:22
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants