-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: explictly close rpc connections when changing base node #6662
Conversation
91582eb
to
b25e245
Compare
Test Results (Integration tests) 2 files 1 errors 9 suites 1h 34m 39s ⏱️ For more details on these parsing errors and failures, see this check. Results for commit e96337b. ♻️ This comment has been updated with latest results. |
Test Results (CI) 3 files 126 suites 9m 47s ⏱️ Results for commit e96337b. ♻️ This comment has been updated with latest results. |
It is an anti-pattern to keep an idle rpc session open. This commit periodically closes idle sessions. RPC establishment was optimised to not require a full RTT, allowing us to establish new sessions as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi there @sdbondi, looking good, also running this branch for my testing. Just some comments.
base_layer/wallet/src/connectivity_service/base_node_peer_manager.rs
Outdated
Show resolved
Hide resolved
chore: reduce logs
test: cucumber logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, a few comments and questions
propagation_source, | ||
// Some other error occurred, we will not propagate, but we'll also not penalise the peer | ||
// for a failure not related to the message contents. | ||
gossipsub::MessageAcceptance::Ignore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following the above logic, should we also not reject on a. ban reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, though I'm trying to avoid any issues with future gossiping. We handle the validation failure above. So this is not a result of validation failing, we're presumably banning the peer for some other reason which may not be related to the block in the gossip message e.g. IO error when requesting the block.
I would say that reducing their gossip score (which may result in them being removed from the mesh) because we could not connect to their RPC is not correct. However since we were not able to validate the message itself, we cannot Accept it either. Open to thoughts
*smt = blockchain_db.db_write_access()?.calculate_tip_smt()?; | ||
warn!(target: LOG_TARGET, "Finished loading SMT into memory from stored db"); | ||
warn!(target: LOG_TARGET, "Finished loading SMT into memory from stored db in {:.2?}", timer.elapsed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yeah, we should go back to caching it.
Doing it every block becomes too intensive, but we should have a checkpoint like every 100 blocks or so,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol yeah just curious if the loading of the SMT was the reason for the base node taking forever to start up. Improvement definitely needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long startup is probably waiting for 5 pings from nodes to decide on a sync peer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no it's before that it takes 20s on our test esme network (from the log above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the smt that takes that long?
target: LOG_TARGET, | ||
"Failed to handle incoming transaction message: {:?}", e | ||
); | ||
// NOTE: We assume that all errors are due to a "bad" transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this might be a bit broad...
But we can use this for now, just we have to monitor it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah could be a problem - let's see.
@@ -41,7 +41,7 @@ impl Default for BaseNodeServiceConfig { | |||
fn default() -> Self { | |||
Self { | |||
base_node_monitor_max_refresh_interval: Duration::from_secs(30), | |||
base_node_rpc_pool_size: 10, | |||
base_node_rpc_pool_size: 3, | |||
event_channel_size: 250, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this perhaps not too low?
I think a wallet can have the following running at a single time:
- TMS validation
- OMS validation
- base node status check
- base node tip
- UTXO scanning
I am not sure if I miss any here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RPC is designed to share clients, basically you'll just have to wait for the previous request to complete before yours is started. Since this I've changed the behaviour to drop RPCs once they are not being used so perhaps a good number is the number of services using it - by your count 6.
Isnt "base node tip" using messaging protocol through the liveness service and not RPC? Will audit this and come up with a minimal number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems it's used in 8 places, difficult to know how many are required at the same time. In my rather basic testing it never got above 3 and if it is ever a problem in any of the services to wait some bounded time for responses. To be safe I'll set it to 8.
more logging persist changes to peer_manager in the watch updates to config files
feat: wallet connectivity improvements
I conducted many system-level stress tests with this branch and feel that the libp2p implementation is pretty solid for almost all use cases, and am happy to approve this PR. Outstanding issues are logged in https://github.com/orgs/tari-project/projects/25/views/1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
ACK
Description
fix!: explictly close rpc connections when changing base node
establish relay quickly
fix wallet connectivity logic
Motivation and Context
How Has This Been Tested?
What process can a PR reviewer use to test or verify this change?
Breaking Changes
BREAKING CHANGE: RPC protocol adds the "check bytes" periodically sent by the server. These bytes will cause errors in older clients.