Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry message failures #189

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Retry message failures #189

wants to merge 1 commit into from

Conversation

koola
Copy link

@koola koola commented Feb 18, 2025

This PR is a continuation of #181 . It introduces two new options WithRetries and WithMaxRetries and aims to retry message failures on actor panics.

The retry logic is as follows;

pid := e.SpawnFunc(func(c *Context) {}, "foo", WithRetries(2), WithMaxRetries(2), WithMaxRestarts(1))

  1. Each message that panics is retried up to two times per message WithRetries(2).
  2. On the second retry, an ActorUnprocessableMessageEvent is broadcast to the eventstream containing the message and the message within the buffer is dropped.
  3. Following two consecutive message retries (4 retries), max retries are reached `WithMaxRetries(2).
  4. The actor is restarted to restore actor state and the messages left in buffer are resumed except those that were unprocessable.
  5. Retries are reset, but assume another two consecutive retries and restart. Max restarts are reached and the actor terminates and cleans up.

Using without retry;

pid := e.SpawnFunc(func(c *Context) {}, "foo", WithRetries(0), WithMaxRestarts(1))

  1. Each message that panics is dropped and actor is restarted as per original logic.

Included in this PR is a test that covers the full retry logic as well as amended tests.

All other tests pass.

@mapogolions
Copy link
Contributor

mapogolions commented Feb 23, 2025

@koola
Thank you for revisiting this topic—it's quite interesting. In my view, a minimalist approach, like Erlang’s, is best. Once a message is retrieved from the mailbox, it’s up to the user to handle its processing and any errors. If the handler panics, it should fail, and the system should continue with the next message. Users should be free to decide their fault-tolerance strategies, as the causes of panics can vary. A one-size-fits-all approach seems unnecessary.

In the most basic implementation, it could look like this.

e.SpawnFunc(func(*Context) {
	defer func() {
		// recover logic here
	}()
})

Of course, this is just one perspective

@koola
Copy link
Author

koola commented Feb 25, 2025

Once a message is retrieved from the mailbox, it’s up to the user to handle its processing and any errors. If the handler panics, it should fail, and the system should continue with the next message. Users should be free to decide their fault-tolerance strategies, as the causes of panics can vary. A one-size-fits-all approach seems unnecessary.

This is still the case with this pr as any messages that are unprocessable are up to the user to handle. What this pr adds is a more fault tolerant solution to the default which is to silently drop messages and move on, the very definition of a one size fits all approach. One I wasn't happy with tbh.

@koola
Copy link
Author

koola commented Feb 25, 2025

Also after thinking, your implementation adds more abstraction and further complexity. The actor already has an event stream that broadcasts events, which imo should include unprocessable messages so that there is only one place to subscribe and source of truth per actor. I think this opens up more options than the basic implementation given.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants