Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: remember last pick status in no real stream #11851

Merged
merged 9 commits into from
Feb 14, 2025

Conversation

shivaspeaks
Copy link
Member

@shivaspeaks shivaspeaks commented Jan 23, 2025

fixes #10749

kannanjgithub
kannanjgithub previously approved these changes Jan 23, 2025
@shivaspeaks shivaspeaks requested a review from ejona86 January 25, 2025 16:04
@@ -347,6 +350,7 @@ private class PendingStream extends DelayedStream {
private final PickSubchannelArgs args;
private final Context context = Context.current();
private final ClientStreamTracer[] tracers;
private Status lastPickStatus;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accessed from multiple threads (pick attempts are single-threaded, but appendTimeoutInsight is a separate thread). I see the parent class DelayedStream is using this as a lock, and PendingStream could do the same for this variable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still has synchronization problems. One assignment uses DelayedClientTransport.lock, another uses no lock, and a read uses PendingStream.this. To be useful, you have to use the same lock in all reads/writes.

Copy link
Member Author

@shivaspeaks shivaspeaks Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring ways that we can do:

I definitely cannot synchronize on local variable having references. That doesn't work. I cannot define a new lock at the top because then that will be blocking every stream (If we want this, we can definitely lock on DelayedClientTransport anyways). We need lock per pending stream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejona86 I believe AtomicReference to store Status is viable here. I made a commit, take a look.

Copy link
Contributor

@kannanjgithub kannanjgithub Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before worrying about the synchronization problem, there is a bigger problem. The PR doesn't help with the case of transport not established at all, and control will not go beyond GrpcUtil.getTransportFromPickResult != null.
Even if we didn't gate on the condition on transport != null ,pick result won't capture the transport status returned to the listener so (last) pick result will just be OK even when the picked sub channel fails to connect.

Copy link
Contributor

@kannanjgithub kannanjgithub Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set up the pending stream as a listener to InternalSubChannel of the picker result instead of using the picker status itself, in newStream and reprocess(). InternalSubChannel will maintain a list of listeners and update them all in transportShutdown with the status. Whenever the picker result changes, reprocess should make each pending stream un-listen to the previous InternalSubChannel and listen to the new InternalSubChannel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current rough approach seems fine to me. What you may not notice is getTransportFromPickResult() is wait-for-ready-aware and queues if the RPC is wait-for-ready. What we want to do here is see the error that the RPC would have failed with if it wasn't wait-for-ready.

Since we're dealing with RPCs here, there must be no involvement with subchannel listeners. The LBs are responsible for that, and we don't know which subchannel is relevant. We need the LB to tell us if there is an error, which it does via the pick result.

When reviewing this earlier I saw that the flows can be simplified now that getTransportFromPickResult() is only used in one class (prior to 8844cf7 ManagedChannelImpl also used it). But I wanted to wait until this PR was merged before making such changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now, the pick result is in fact the status of the failed transport. With multiple updates to pick result it seems I only noticed the first status update of OK when connecting but not the later one during transport shutdown. This is the same status that I was trying to surface it in other ways but that is redundant and is already available.

About lock for the pending stream's pick result status, is there any issue if appendTimeoutInsight accesses this value outside any lock at all? Like, Status cannot have some garbage value just because the read is interleaved with a locked write. There is the other issue of a non locked variable cached in the thread's register but here the pick result status accessed through object references can't be cached like that.

@kannanjgithub kannanjgithub dismissed their stale review January 30, 2025 11:26

Wait for review comments from Eric to be done.

@shivaspeaks shivaspeaks requested a review from ejona86 January 30, 2025 19:29
}
// User code provided authority takes precedence over the LB provided one.
if (callOptions.getAuthority() == null && pickResult.getAuthorityOverride() != null) {
stream.setAuthority(pickResult.getAuthorityOverride());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did these lines come about (looks like an incorrect merge)? This undoes the change done in https://github.com/grpc/grpc-java/pull/11862/files.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right. Somehow merge branch master into this branch caused it to happen.

@shivaspeaks shivaspeaks requested a review from ejona86 February 12, 2025 09:22
@shivaspeaks shivaspeaks merged commit 7585b16 into grpc:master Feb 14, 2025
16 checks passed
@shivaspeaks shivaspeaks deleted the stub-with-wait-for-ready branch February 14, 2025 06:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stub.withWaitForReady()'s DEADLINE_EXECEEDED lacks connection failure information
3 participants