-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failsafes to prevent a consensus round from taking too long #5277
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5277 +/- ##
=======================================
Coverage 78.1% 78.1%
=======================================
Files 790 790
Lines 67911 67988 +77
Branches 8234 8252 +18
=======================================
+ Hits 53025 53093 +68
- Misses 14886 14895 +9
🚀 New features to boost your workflow:
|
f07992d
to
76c27a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to have a unit test for the last part of DisputedTx.h
; not sure how realistic that request is. Approved in any case.
6e513d9
to
26ab221
Compare
8dcbf91
to
a6d3cea
Compare
- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.
a6d3cea
to
197356b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map
pairs. There are two viable fixes:
-
Add an assignment operator to
ConsensusParms
:
We could add a custom assignment operator forConsensusParms
that, instead of assigning theavalancheCutoffs
map directly (which triggers the error), manually copies its contents into the target map. -
Make
avalancheCutoffs
const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does
peer->consensusParms = parms;
so that it no longer tries to perform such an assignment.
* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)
* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions for now; I admire the unit test coverage of this but am not yet finished.
src/xrpld/consensus/ConsensusParms.h
Outdated
// See if enough time has passed to move on to the next. | ||
XRPL_ASSERT( | ||
nextCutoff.consensusTime >= currentCutoff.consensusTime, | ||
"ripple::DisputedTx::updateVote next state valid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ripple::DisputedTx::updateVote next state valid"); | |
"ripple::getNeededWeight : next state valid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. Probably a cut/paste error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good change. I like the unis tests coverage (but an open question re. commented out section), I like how the new Expired
state only kicks in when we are not stalled
on all transactions (nice decoupling on both states).
* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)
* upstream/develop: refactor: Remove unused and add missing includes (5293)
- Document "stalled" in checkConsensusReached. Also return early. - Log stalled consensus at higher level than regular. - const correctness (stalled in haveConsensus) - Change type of AvalancheCutoff::consensusTime. - Fix XRPL_ASSERT label - Log tx ID on unchanged dispute vote
src/xrpld/consensus/ConsensusParms.h
Outdated
// See if enough time has passed to move on to the next. | ||
XRPL_ASSERT( | ||
nextCutoff.consensusTime >= currentCutoff.consensusTime, | ||
"ripple::getNeededWeight next state valid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put colon :
between scope name and description
"ripple::getNeededWeight next state valid"); | |
"ripple::getNeededWeight : next state valid"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put colon
:
between scope name and description
Fixed
Would the better name for the state be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One minor suggestion for PR description:
This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.
Would the better name for the state be a livelock
rather than a deadlock
?
From wiki : "A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing."
Proposed commit message:
|
High Level Overview of Change
This PR, if merged, will introduce two fail safes into the consensus logic to prevent a consensus round from remaining in the establish phase indefinitely.
Context of Change
At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.
The current evidence indicates that two things happened.
This led to a livelock situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.
This decision algorithm has been in place for at least 8 years, and possibly since the first release of
rippled
. The odds of it happening were thought to be 0, but it turns out they're just very very small.Type of Change
This change is fully backward and forward compatible, and does not require an amendment.
Before / After
This is an outline of the changes in this PR.
NetworkOPsImp::processTrustedProposal
, if the proposal is from us, it don't process it. This should be impossible, but this will help confirm if not.checkConsensusReached
's other restrictions, such as minimum proposers, still apply)avMIN_ROUNDS
) "inner rounds" ofphaseEstablish
,avSTALLED_ROUNDS
) "inner rounds",ledgerMAX_CONSENSUS
) to 120s (ledgerABANDON_CONSENSUS
).phaseEstablish
), thenConsensusState::Expired
is treated asConsensusState::No
.ConsensusParms.h
has been rewritten as more of an explicit state machine. It's basically a map of states to their time cutoff, percentage cutoff, and next state, and agetNeededWeight
function that will evaluate whether to move to the next state.avMIN_ROUNDS
) "inner rounds" in each state. Close time consensus does not have this restriction (but it could).