-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a page to expose non-deterministic errors details #185
Conversation
|
||
 | ||
|
||
Workflow A then will be picked up by Host B and continues its execution. This process is called <b>change of workflow ownership</b>. However, after Host B gains ownership of the Workflow A, it does not have any information about its historical executions. For example, Workflow A may have executed many activities and it fails. Host B needs to redo all its history until the moment of failure. The process of reconstructing history of a workflow is called <b>history replay</b>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This process is called change of workflow ownership.
hmm. do we call it that literally anywhere else? it's also not really "ownership", it can move around randomly constantly and that's also normal and fine.
I'd be very cautious of making new terms in a doc like this, particularly when they are used in other workflow systems for sometimes very different purposes.
```go | ||
err = workflow.ExecuteActivity(ctx, ActivityC, a).Get(ctx, nil) | ||
``` | ||
And restart worker, then it will run into this error. Because in the history, the workflow has scheduled ActivityB with input a, but during replay, it schedules ActivityC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure this also occurs when e.g. changing an activity to a timer, not just when the same high-level type contains different values that we check. probably worth covering that too.
``` | ||
For source code click [here](https://github.com/uber-go/cadence-client/blob/e5081b085b0333bac23f198e57959681e0aee987/internal/internal_decision_state_machine.go#L693) | ||
|
||
This usually means workflow history is corrupted due to some bug. For example, the same activity can be scheduled and differentiated by activityID. So ActivityIDs for different activities are supposed to be unique in workflow history. If however we have an ActivityID collision, replay will run into this error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably check this more closely, but it might be true. I think it's currently thrown for other scenarios though, including "it replayed blindly too far before validating, and did something irrational instead of detecting the non-determinism earlier"
These are some common changes that can be categorized by these previous 4 types mentioned above. | ||
|
||
1. Changing the order of executing Cadence defined operations, such as activities, timer, child workflows, signals, cancelRequest. | ||
2. Change the duration of a timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changing duration is fine, adding or removing is not.
(possible exception: I'm not sure if using a timer with duration 0 is the same as "no timer" or not...)
## Root cause of non-deterministic errors | ||
Cadence workflows are designed as long-running operations, and therefore the workflow code you write must be deterministic so that no matter how many time it is executed it alwasy produce the same results. | ||
|
||
In production environment, your workflow code will run on a distributed system orchestrated by clusters of machines. However, machine failures are inevitable and can happen anytime to your workflow host. If you have a workflow running for long period of time, maybe months even years, and it fails due to loss of a host, it will be resumed on another machine and continue the rest of its execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this and the diagram: tbh I think this might sound too close to "it runs on cadence's machines" instead of "running in your workers".
|
||
### What are some changes that will NOT trigger non-deterministic errors? | ||
|
||
Code changes that are free of non-deterministic erorrs normally do not involve decision tasks in Cadence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes that are free of non-deterministic erorrs normally do not involve decision tasks in Cadence. | |
Code changes that are free of non-deterministic errors normally do not involve decision tasks in Cadence. |
though tbh I'm not quite sure what this is trying to teach. from above, everything involves decision tasks, because that's what creates everything....... and you can't change decision tasks themselves at all, so....
Code changes that are free of non-deterministic erorrs normally do not involve decision tasks in Cadence. | ||
|
||
1. Activity input and output are structs but changing content of this struct does not produce non-deterministic errors. However, it may produce serialization errors based on your data converter implementation. Cadence uses `json.Marshal` and `json.Marshal` function by default. | ||
2. Code changes that does not modify history events are safe to be checked in. For example, logging or metrics implementations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Code changes that does not modify history events are safe to be checked in. For example, logging or metrics implementations. | |
2. Code changes that do not modify history events are safe. For example, logging or metrics implementations. |
"checked in" isn't particularly relevant and isn't mentioned anywhere else anyway.
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
Co-authored-by: Steven L <imgroxx@gmail.com>
This page provides a detailed overview of Cadence non-deterministic errors. In most cases developers see a non-deterministic error but they don't know where to start to debug. The concept of decision tasks unfortunately got leaked to the public so to better use Cadence developers need to at least know that they are and how to construct the decision tasks stack.