Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[specs] Using 10 schedules for random exploration can be insufficient for correctness checking/fuzzing #10

Open
AD1024 opened this issue Feb 28, 2025 · 4 comments
Assignees

Comments

@AD1024
Copy link

AD1024 commented Feb 28, 2025

As shown in the referred script

[Int] $NumIters = 10,

Testing with PChecker only takes 10 schedules. This can be insufficient. By default, PChecker (the fuzzer of P model) uses random exploration. What it does is basically configuring the model instances with non-deterministic inputs and exploring different orders of event interleaving. Therefore, 10 schedules may not expose enough behaviors of the P model hence low confidence for correctness. We would suggest using more schedules, e.g., 10k.

We assume the reason using 10 instead of more is that it usually takes too long for PChecker to finish with the checking models, especially DataStorage. This usually is due to instantiating multiple Timers in the P model. Timers can drastically slow down PChecker since the way it issues eTimeOut is through a non-deterministic check as shown below

state WaitForTimerRequests {
on eStartTimer goto TimerStarted;
ignore eCancelTimer, eDelayedTimeOut;
}
state TimerStarted {
defer eStartTimer;
entry {
if($)
{
send client, eTimeOut;
goto WaitForTimerRequests;
}
else
{
send this, eDelayedTimeOut;
}
}
on eDelayedTimeOut goto TimerStarted;
on eCancelTimer goto WaitForTimerRequests;

If the non-deterministic check failed, it cycles back to the same state. This creates a lot of event redundancies (especially when there are multiple timers) during random exploration by the eDelayTimeOut events. This can easily exhaust the maximum number of steps. A potential workaround is:
instead of assigning each component a timer to inform about the timeout, use a "ghost" view server to have a central control of timeout. An example is here: https://github.com/AD1024/P-Tutorial-Raft/blob/44d2e349a804109d1c1dcc2658dcf89d441d8880/PSrc/View.p#L128-L184
The idea is that when an eTimeOut occurs, the view server randomly chooses some components and inform them about the timeout event. This way, PChecker will not be overwhelmed by the eDelayTimeOut events.

Hope this might help with the P modeling.

@wlxiong wlxiong self-assigned this Feb 28, 2025
@ankushdesai
Copy link

The number 10 was added here just to make the test scripts run faster. Internally, the team ran the models for more schedules.

We also noticed that there are some invariants from the CRAQ paper that are not checked.

@AD1024
Copy link
Author

AD1024 commented Feb 28, 2025

The number 10 was added here just to make the test scripts run faster. Internally, the team ran the models for more schedules.

We also noticed that there are some invariants from the CRAQ paper that are not checked.

I see. This makes more sense. But still, I think we should either have a Timer primitive (given how frequently it is used to model faults) or noting about the proper ways to handle timeouts.

@wlxiong
Copy link
Contributor

wlxiong commented Mar 1, 2025

The P model provides a clear view of the CRAQ variant implemented in 3FS for the dev team. Besides being a formal specs to verify the correctness of the system, it functions more like the pseudocode for the chunk storage system.

During development, we follow an iterative process: after making small changes, we run all tests for 10 schedules to catch easy bugs. If everything looks good, start multiple P checkers in parallel, each runs 1000 schedules. At the end of the day, we submit jobs to start multiple P checkers, each runs 10000 schedules, and check the results next morning.

A potential workaround is: instead of assigning each component a timer to inform about the timeout, use a "ghost" view server to have a central control of timeout.

Yes, using a global clock is more efficient in this case. Actually we have a similar implementation in a previous version: the heartbeat check loop in MgmtService serves as the global clock. But implementing it in an independent actor seems to be a better approach.

We also noticed that there are some invariants from the CRAQ paper that are not checked.

Can you be more specific? To this day, we still cannot persuade ourselves that we have fully grasped the intricacies of CRAQ. A working implementation needs to fill in lots of missing pieces.

@ankushdesai
Copy link

This is very exciting and thanks for sharing! I will get back you on the spec with details in a couple of days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants