Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use nproc for clarity and add a unit test to verify that the message causing the crash is dropped #181

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mapogolions
Copy link
Contributor

@mapogolions mapogolions commented Jan 25, 2025

Use nproc for clarity and add a unit test to verify that the message causing the crash is dropped

@anthdm
Copy link
Owner

anthdm commented Jan 25, 2025

Interesting. Thanks will look into this today. Thanks!!

@anthdm
Copy link
Owner

anthdm commented Jan 29, 2025

I see what you mean here. But is dropping the message that caused the crash the correct behaviour? Maybe its an internal state that is corrupt that has nothing to do with the message that has just been processed and caused the crash. Maybe a restart and restoring the actor back from its state in the DB, could not crash anymore on the message that crashed it.

What are your thoughts?

@tprifti

@anthdm anthdm requested a review from perbu January 29, 2025 18:56
@mapogolions
Copy link
Contributor Author

It is important to understand that the algorithm has not changed at all. I have merely reordered the variables to explain why the increment of nproc at the end of the loop and its usage in recovery caused some tests to freeze. The added test will successfully pass on the code from the master branch. The current implementation (master branch) follows a minimalist paradigm, which assumes that if a message is read but cannot be processed, we simply move on to the next one.

// Processed 'nproc' messages successfully, then encountered a crash on the next message.
// After restart, processing begins with the message following the one that caused the crash.
// 'nrecv' represents the total number of successfully received messages, including the one that caused the crash.
nrecv := nproc + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line of code is problematic. If the actor receives a message that will cause a crash, we should not skip that message.
Example: Actor receives a new message -> store data in DB.
If database connection fails, panic -> recover() -> retry

We do not want to skip that message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation from the master branch already does this. Skips a message that failed to process.

@koola koola mentioned this pull request Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants