Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: capture errors that shouldnt cause the whole plan to be discarded during network resolution #444

Open
wants to merge 6 commits into
base: transition-to-runkit
Choose a base branch
from

Conversation

jhonasiv
Copy link
Contributor

@jhonasiv jhonasiv commented Dec 6, 2024

Depends on:

We introduced a ResolutionError to mark errors during the new plan
network resolution shouldnt be raised, which causes the whole transation
to fail. Instead, we capture them and fail the deployment of the
specific tasks that caused them.

@jhonasiv jhonasiv requested review from doudou and wvmcastro December 6, 2024 13:34
@jhonasiv jhonasiv self-assigned this Dec 6, 2024
@jhonasiv jhonasiv force-pushed the capture-errors branch 3 times, most recently from d0b83eb to 63070d9 Compare December 6, 2024 17:17
@jhonasiv jhonasiv force-pushed the capture-errors branch 2 times, most recently from 72722c1 to d070e92 Compare December 23, 2024 21:12
@jhonasiv jhonasiv changed the title [WIP] feat: capture errors that shouldnt cause the whole plan to be discarded during network resolution feat: capture errors that shouldnt cause the whole plan to be discarded during network resolution Jan 29, 2025
Copy link

@wvmcastro wvmcastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is not clear to me, when trying to deploy the below composition there is an error with second deployment and on_error: commit. The deployer will still deploy first and second child?

class Parent < Compositions
  add Model1, as: "first"
  add Model2, as "second"
  add Model3, as "third"
  ...
end

@jhonasiv
Copy link
Contributor Author

jhonasiv commented Feb 3, 2025

One thing that is not clear to me, when trying to deploy the below composition there is an error with second deployment and on_error: commit. The deployer will still deploy first and second child?

class Parent < Compositions
  add Model1, as: "first"
  add Model2, as "second"
  add Model3, as "third"
  ...
end

If a child of a parent fails the deployment validation, itself and all of its parents also fail, therefore the other children will also fail. PS: on_error: commit is only used by profile_assertions tests, so in normal circumstances the deployer will do what I just said, but when on_error: commit the whole transaction would be commited, which would lead to all present errors being propagated at the end of the network generation process.

This adapts the exceptions that previously refered with multiple tasks
to the error capture system, that requires each exception to refer to a
single task. The error messages were adapted for a single task context,
but unfortunately they can still a bit too verbose.
feature flag

This creates the structures that should be used to capture errors. One
can retain the previous behavior by using the RaiseErrorHandler, or use
the error capture via the ResolutionErrorHandler. The feature flag
is meant to be used to switch between the two options.
… them

Through the resolution error handler, capture the errors during the
network generation. Then, they are propagated as failed events for each
individual task that caused the error.

The previous behavior is enabled when the
capture_errors_during_network_resolution feature flag is off. That way,
any exception is immediately raised. 

Some behaviors were slightly changed to work for both the old and new features, 
specially for unit tests.
errors

Whenever resolution errors are reported, it will emit failed for each
individual task AND raise PartialNetworkResolution since the profile
assertions require an exception to fail
The error capture introduces a slight change in the engine 
interface, so we need to adapt these so they keep working as before.
@jhonasiv jhonasiv force-pushed the capture-errors branch 2 times, most recently from 729b796 to aa31a18 Compare February 25, 2025 14:08
Now that we can capture errors, we have to add more information on the
result of the application of the generated network to the plan. This is
done by changing the return of #syskit_apply_resolution_results to a
struct that has information of the errors and the planned instances.
Comment on lines 83 to +84
core early_deploy: true
core capture_errors_during_network_resolution: true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run with both activated as well ... no ?

validate_deployed_network: (true if Syskit.conf.early_deploy?),
early_deploy: Syskit.conf.early_deploy?
validate_deployed_network: Syskit.conf.early_deploy?,
early_deploy: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from Syskit.conf.early_deploy?, especially given that you kept it for validate_deployed_network

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to make early_deploy the default for validate_deployed_network

early_deploy: Syskit.conf.early_deploy?,
validate_deployed_network: early_deploy


if on_error == :save
errors.each do |e|
handle_resolution_exception(e, on_error: on_error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It really does not make sense to call this more than once.

Also, the (very little) difference in logic between handle_resolution_errors and handle_resolution_exception looks suspicious.

)

unless resolution_errors.empty? || cleanup_resolution_errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that unless(complex logic) is very hard to read ... I'd keep it for simple logic (only one boolean)

Here, if !resolution_errors.empty? && !cleanup_resolution_errors

elsif on_error == :commit
work_plan.commit_transaction
else
errors.each { |err| real_plan.execution_engine.add_error(err) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very happy that you modify real_plan from here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: should be removed.

Comment on lines +170 to +175
resolution_errors.each do |error|
t = error.planned_task
expect_execution do
t.failed_event.emit(error.original_exceptions)
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to invert, that is run the loop inside the expect_execution

And replace expect_execution by execute. expect_execution without a to block actually does nothing

@@ -227,31 +238,32 @@ def syskit_deploy_normalized_placeholder_tasks(
.to { emit(*not_running.map(&:start_event)) }

resolve_options = Hash[on_error: :commit].merge(resolve_options)
begin
resolution_errors = execute do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need for execute here ?

Comment on lines +285 to +287
# formatted = formatted.gsub(/<id:\d+>/, "<id:X>")
# .gsub(/Class:0x[0-9a-f]+/,
# "Class:0xXXXXXX")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover

Comment on lines +340 to +343
# assert_equal expected,
# formatted.gsub(/<id:\d+>/, "<id:X>")
# .gsub(/Class:0x[0-9a-f]+>:0x[0-9a-f]+>/,
# "Class:0xXXXXXX>:0xXXXXXX>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover

t.success_event.emit
end
nil
resolution_apply_result.errors.group_by(&:planning_task).each do |task, e|
task.failed_event.emit e.flat_map(&:original_exception)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task is not guaranteed to be running here. Might be from a previous deploy.

ensure
syskit_current_resolution_keepalive.discard_transaction
@syskit_current_resolution = nil
end

running_requirement_tasks.each do |t|
resolution_apply_result.instances.each_key do |t|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of finishing InstanceRequirementsTask when the resolution finishes, my proposition would be to

  • have an intermediate event resolution_success, which gets emitted here instead of success
  • change the logic that finds what needs to be deployed to only pick the running InstanceRequirementsTask
  • change the logic that finds whether there are new requirements to check if there are running InstanceRequirementsTask without the resolution_success event emitted (resolution_success_event.emitted?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic in runtime/apply_deployment_tasks_from_plan.rb

rescue ::Exception => e # rubocop:disable Lint/RescueException
running_requirement_tasks.each do |t|
t.failed_event.emit(e)
end
e
NetworkGeneration::SystemNetworkPlanApplyResult.new(
fulfilled: false, errors: [e]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be returning a SystemNetworkPlanApplyResult that has the same instances as before the resolution, and a proper list of errors attached to the new ones.

tasks_to_instanciate.map(&:planning_task),
validate_generated_network: false,
early_deploy: false
)
unless resolution_errors.empty?
resolution_errors.each do |error|
t = error.planned_task
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

planning_task ?

elsif on_error == :commit
work_plan.commit_transaction
else
errors.each { |err| real_plan.execution_engine.add_error(err) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conclusion: should be removed.

Comment on lines +6 to +7
Syskit.conf.early_deploy = true
Syskit.conf.capture_errors_during_network_resolution = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decouple the two flags, and run 4 test suites (plain core, with early deploy, with capture errors and with both)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants