Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
rls: Synchronization fixes in CachingRlsLbClient
This started with combining handleNewRequest with asyncRlsCall, but that emphasized pre-existing synchronization issues and trying to fix those exposed others. It was hard to split this into smaller commits because they were interconnected. handleNewRequest was combined with asyncRlsCall to use a single code flow for handling the completed future while also failing the pick immediately for thottled requests. That flow was then reused for refreshing after backoff and data stale. It no longer optimizes the RPC completing immediately because that would not happen in real life; it only happens in tests because of inprocess+directExecutor() and we don't want to test a different code flow in tests. This did require updating some of the tests. One small behavior change to share the combined asyncRlsCall with backoff is we now always invalidate an entry after the backoff. Previously the code could replace the entry with its new value in one operation if the asyncRlsCall future completed immediately. That only mattered to a single test which now sees an EXPLICIT eviction. SynchronizationContext used to provide atomic scheduling in BackoffCacheEntry, but it was not guaranteeing the scheduledRunnable was only accessed from the sync context. The same was true for calling up the LB tree with `updateBalancingState()`. In particular, adding entries to the cache during a pick could evict entries without running the cleanup methods within the context, as well as the RLS channel transitioning from TRANSIENT_FAILURE to READY. This was replaced with using a bare Future with a lock to provide atomicity. BackoffCacheEntry no longer uses the current time and instead waits for the backoff timer to actually run before considering itself expired. Previously, it could race with periodic cleanup and get evicted before the timer ran, which would cancel the timer and forget the backoffPolicy. Since the backoff timer invalidates the entry, it is likely useless to claim it ever expires, but that level of behavior was preserved since I didn't look into the LRU cache deeply. propagateRlsError() was moved out of asyncRlsCall because it was not guaranteed to run after the cache was updated. If something was already running on the sync context, then RPCs would hang until another update caused updateBalancingState(). Some methods were moved out of the CacheEntry classes to avoid shared-state mutation in constructors. But if we add something in a factory method, we want to remove it in a sibling method to the factory method, so additional code is moved for symmetry. Moving shared-state mutation ouf of constructors is important because 1) it is surprising and 2) ErrorProne doesn't validate locking within constructors. In general, having shared-state methods in CacheEntries also has the problem that ErrorProne can't validate CachingRlsLbClient calls to CacheEntry. ErrorProne can't know that "lock" is already held because CacheEntry could have been created from a _different instance_ of CachingRlsLbClient and there's no way for us to let ErrorProne prove it is the same instance of "lock". DataCacheEntry still mutates global state that requires a lock in its constructor, but it is less severe of a problem and it requires more choices to address.
- Loading branch information