Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: migration finalization process #4576

Merged
merged 3 commits into from
Feb 10, 2025

Conversation

BorysTheDev
Copy link
Contributor

@BorysTheDev BorysTheDev commented Feb 7, 2025

fixes: #4551
problem: Pause() failed and data was sent to the target node after LSN, so migration can't be finished

@BorysTheDev BorysTheDev changed the title fix: migration finalization process fix: migration finalization process NOT READY FOR REVIEW Feb 7, 2025
@BorysTheDev BorysTheDev force-pushed the fix_migration_finalization_process branch from 479031d to dc22dc9 Compare February 7, 2025 10:55
@BorysTheDev BorysTheDev changed the title fix: migration finalization process NOT READY FOR REVIEW fix: migration finalization process Feb 7, 2025
src/server/cluster/incoming_slot_migration.cc Outdated Show resolved Hide resolved
src/server/cluster/outgoing_slot_migration.cc Show resolved Hide resolved
[&](const auto& flow) { return flow->GetLastAttempt() == attempt; }))) {
// if data was sent after LSN, WaitFor() always returns false so to reduce wait time
// we check current state and if WaitFor false but GetLastAttempt() == attempt
// the Join is failed and we can return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really true? I enter the body of the while(true) for the first time and I call

wait_res = bc_->WaitFor(wait_time) and it returns false because it got notified. Now I check with std::_all_of and it returns true. Why did the migration fail here ? Isn't that the happy path ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No because waitFor returns false if data was sent after LSN or if attempt isn't correct

if ((bc_->WaitFor(absl::ToInt64Milliseconds(timeout - passed) * 1ms)) &&
(std::all_of(shard_flows_.begin(), shard_flows_.end(),
[&](const auto& flow) { return flow->GetLastAttempt() == attempt; }))) {
// if data was sent after LSN, WaitFor() always returns false so to reduce wait time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of
bc_->Add(); // the flow isn't finished so we lock it again
in line 95 you should do bc_->Cancle() ? wouldnt this fix the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we need to change our migration finalize algorithm. The current fix should work OK. let's merge it and then we can discuss changes, because we want to add the force finish if Pause failed

@BorysTheDev BorysTheDev force-pushed the fix_migration_finalization_process branch from f08c722 to 872e6cb Compare February 10, 2025 10:07
@BorysTheDev BorysTheDev force-pushed the fix_migration_finalization_process branch from 872e6cb to b96665e Compare February 10, 2025 10:08
state_.store(MigrationState::C_FINISHED);
keys_number_ = cluster::GetKeyCount(slots_);
}
return wait_res;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say this flow is a bit confusing
In this flow we might return false here but not print warning and call ReportError as we have in the condition (passed >= timeout)
Does this makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, to my mind it's not really needed because we send the error to the source node in anycase. But it can help a little to debug so I will add them

@BorysTheDev BorysTheDev merged commit d0087aa into main Feb 10, 2025
10 checks passed
@BorysTheDev BorysTheDev deleted the fix_migration_finalization_process branch February 10, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cluster_test.py::test_migration_rebalance_node
3 participants