-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: migration finalization process #4576
Conversation
479031d
to
dc22dc9
Compare
[&](const auto& flow) { return flow->GetLastAttempt() == attempt; }))) { | ||
// if data was sent after LSN, WaitFor() always returns false so to reduce wait time | ||
// we check current state and if WaitFor false but GetLastAttempt() == attempt | ||
// the Join is failed and we can return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really true? I enter the body of the while(true)
for the first time and I call
wait_res = bc_->WaitFor(wait_time)
and it returns false because it got notified. Now I check with std::_all_of
and it returns true
. Why did the migration fail here ? Isn't that the happy path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No because waitFor returns false if data was sent after LSN or if attempt isn't correct
if ((bc_->WaitFor(absl::ToInt64Milliseconds(timeout - passed) * 1ms)) && | ||
(std::all_of(shard_flows_.begin(), shard_flows_.end(), | ||
[&](const auto& flow) { return flow->GetLastAttempt() == attempt; }))) { | ||
// if data was sent after LSN, WaitFor() always returns false so to reduce wait time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of
bc_->Add(); // the flow isn't finished so we lock it again
in line 95 you should do bc_->Cancle() ? wouldnt this fix the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we need to change our migration finalize algorithm. The current fix should work OK. let's merge it and then we can discuss changes, because we want to add the force finish if Pause failed
f08c722
to
872e6cb
Compare
872e6cb
to
b96665e
Compare
state_.store(MigrationState::C_FINISHED); | ||
keys_number_ = cluster::GetKeyCount(slots_); | ||
} | ||
return wait_res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will say this flow is a bit confusing
In this flow we might return false here but not print warning and call ReportError as we have in the condition (passed >= timeout)
Does this makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, to my mind it's not really needed because we send the error to the source node in anycase. But it can help a little to debug so I will add them
fixes: #4551
problem: Pause() failed and data was sent to the target node after LSN, so migration can't be finished