Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: explictly close rpc connections when changing base node #6662

Merged
merged 32 commits into from
Nov 8, 2024

Conversation

sdbondi
Copy link
Member

@sdbondi sdbondi commented Nov 1, 2024

Description

fix!: explictly close rpc connections when changing base node
establish relay quickly
fix wallet connectivity logic

Motivation and Context

How Has This Been Tested?

What process can a PR reviewer use to test or verify this change?

Breaking Changes

  • None
  • Requires data directory on base node to be deleted
  • Requires hard fork
  • Other - Please specify

BREAKING CHANGE: RPC protocol adds the "check bytes" periodically sent by the server. These bytes will cause errors in older clients.

@sdbondi sdbondi force-pushed the libp2p-fixes branch 3 times, most recently from 91582eb to b25e245 Compare November 1, 2024 07:21
Copy link

github-actions bot commented Nov 1, 2024

Test Results (Integration tests)

 2 files   1 errors  9 suites   1h 34m 39s ⏱️
18 tests 14 ✅ 0 💤  4 ❌
26 runs  14 ✅ 0 💤 12 ❌

For more details on these parsing errors and failures, see this check.

Results for commit e96337b.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Nov 1, 2024

Test Results (CI)

    3 files    126 suites   9m 47s ⏱️
1 160 tests 1 160 ✅ 0 💤 0 ❌
3 480 runs  3 480 ✅ 0 💤 0 ❌

Results for commit e96337b.

♻️ This comment has been updated with latest results.

@sdbondi sdbondi changed the title fix: explictly close rpc connections when changing base node fix!: explictly close rpc connections when changing base node Nov 1, 2024
@sdbondi sdbondi marked this pull request as ready for review November 1, 2024 15:02
@sdbondi sdbondi requested a review from a team as a code owner November 1, 2024 15:02
It is an anti-pattern to keep an idle rpc session open. This commit periodically closes idle sessions. RPC establishment was optimised to not require a full RTT, allowing us to establish new sessions as needed
Copy link
Contributor

@hansieodendaal hansieodendaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi there @sdbondi, looking good, also running this branch for my testing. Just some comments.

Copy link
Collaborator

@SWvheerden SWvheerden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few comments and questions

propagation_source,
// Some other error occurred, we will not propagate, but we'll also not penalise the peer
// for a failure not related to the message contents.
gossipsub::MessageAcceptance::Ignore,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following the above logic, should we also not reject on a. ban reason?

Copy link
Member Author

@sdbondi sdbondi Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, though I'm trying to avoid any issues with future gossiping. We handle the validation failure above. So this is not a result of validation failing, we're presumably banning the peer for some other reason which may not be related to the block in the gossip message e.g. IO error when requesting the block.

I would say that reducing their gossip score (which may result in them being removed from the mesh) because we could not connect to their RPC is not correct. However since we were not able to validate the message itself, we cannot Accept it either. Open to thoughts

*smt = blockchain_db.db_write_access()?.calculate_tip_smt()?;
warn!(target: LOG_TARGET, "Finished loading SMT into memory from stored db");
warn!(target: LOG_TARGET, "Finished loading SMT into memory from stored db in {:.2?}", timer.elapsed());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha yeah, we should go back to caching it.
Doing it every block becomes too intensive, but we should have a checkpoint like every 100 blocks or so,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol yeah just curious if the loading of the SMT was the reason for the base node taking forever to start up. Improvement definitely needed here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long startup is probably waiting for 5 pings from nodes to decide on a sync peer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's before that it takes 20s on our test esme network (from the log above)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the smt that takes that long?

target: LOG_TARGET,
"Failed to handle incoming transaction message: {:?}", e
);
// NOTE: We assume that all errors are due to a "bad" transaction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this might be a bit broad...
But we can use this for now, just we have to monitor it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah could be a problem - let's see.

@@ -41,7 +41,7 @@ impl Default for BaseNodeServiceConfig {
fn default() -> Self {
Self {
base_node_monitor_max_refresh_interval: Duration::from_secs(30),
base_node_rpc_pool_size: 10,
base_node_rpc_pool_size: 3,
event_channel_size: 250,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this perhaps not too low?
I think a wallet can have the following running at a single time:

  • TMS validation
  • OMS validation
  • base node status check
  • base node tip
  • UTXO scanning

I am not sure if I miss any here.

Copy link
Member Author

@sdbondi sdbondi Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RPC is designed to share clients, basically you'll just have to wait for the previous request to complete before yours is started. Since this I've changed the behaviour to drop RPCs once they are not being used so perhaps a good number is the number of services using it - by your count 6.
Isnt "base node tip" using messaging protocol through the liveness service and not RPC? Will audit this and come up with a minimal number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's used in 8 places, difficult to know how many are required at the same time. In my rather basic testing it never got above 3 and if it is ever a problem in any of the services to wait some bounded time for responses. To be safe I'll set it to 8.

@hansieodendaal
Copy link
Contributor

I conducted many system-level stress tests with this branch and feel that the libp2p implementation is pretty solid for almost all use cases, and am happy to approve this PR. Outstanding issues are logged in https://github.com/orgs/tari-project/projects/25/views/1.

Copy link
Contributor

@hansieodendaal hansieodendaal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

ACK

@SWvheerden SWvheerden merged commit 50bd9ae into tari-project:feat-libp2p Nov 8, 2024
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants