Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when calling wf in eager task with thread #3125

Merged
merged 6 commits into from
Feb 12, 2025
Merged

Conversation

wild-endeavor
Copy link
Contributor

@wild-endeavor wild-endeavor commented Feb 11, 2025

Why are the changes needed?

Unable to run subworkflows, as part of eager workflows, via to_thread. This PR only fixes the ability to run subworkflows on the backend. Local execution still needs additional fixes.

What changes were proposed in this pull request?

  • A return was missing from the call handler, leading to local runs of subworkflows locally run by eager to run multiple things.
  • The compile step has been taken out of the workflow call handler.
  • In local testing found that the behavior of the reconcile function in the worker doesn't handle the re-run case where the execution is already found. The reconcile function now will always just sync the execution, and then handle state management appropriately.
    • Additional terminal states are mapped to failed in the update object.
    • asserts were hard to debug - these have been changed to explicit assertion errors.

The issues encountered while debugging this...

  • Local execute in my test case (not included) was only succeeding maybe 1 out of every 6-7 times.
  • The compile step that was moved into the conditional was causing failures all the time.

My running theory for this is because the context manager is not thread safe. Need to investigate further how to_thread's contextvar copy works. Context copying is needed because we need to accurately represent the state upon entering the thread, but removing the context frames seems to be polluting each other's context.

How was this patch tested?

Tested locally and on Union internal clusters.

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Summary by Bito

This PR enhances subworkflow execution in eager workflows via to_thread by implementing improved error handling, state management, and thread safety. Key changes include fixing missing return statements in async calls, improving error messages for worker queue assertions, and enhancing context manager thread safety. The implementation removes unused abort metadata and includes documentation improvements along with comprehensive test additions for workflow execution phases.

Unit tests added: True

Estimated effort to review (1-5, lower is better): 2

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

Attention: Patch coverage is 7.89474% with 35 lines in your changes missing coverage. Please review.

Project coverage is 46.59%. Comparing base (74bde49) to head (9f4a4d9).
Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
flytekit/core/worker_queue.py 0.00% 18 Missing ⚠️
flytekit/core/python_function_task.py 6.25% 15 Missing ⚠️
flytekit/core/promise.py 0.00% 1 Missing ⚠️
flytekit/core/workflow.py 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3125       +/-   ##
===========================================
- Coverage   76.83%   46.59%   -30.25%     
===========================================
  Files         204      204               
  Lines       21665    21622       -43     
  Branches     2799     2806        +7     
===========================================
- Hits        16646    10074     -6572     
- Misses       4248    11053     +6805     
+ Partials      771      495      -276     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
@wild-endeavor wild-endeavor changed the title Eager wfs Bug when calling wf in eager task with thread Feb 12, 2025
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
@flyte-bot
Copy link
Contributor

flyte-bot commented Feb 12, 2025

Code Review Agent Run #81ea53

Actionable Suggestions - 1
  • flytekit/core/python_function_task.py - 1
Additional Suggestions - 2
  • flytekit/core/worker_queue.py - 1
    • Consider consolidating failed status conditions · Line 237-240
  • tests/flytekit/unit/core/test_worker_queue.py - 1
    • Consider improving test parametrization readability · Line 265-270
Review Details
  • Files reviewed - 5 · Commit Range: 244df7d..1c853cf
    • flytekit/core/promise.py
    • flytekit/core/python_function_task.py
    • flytekit/core/worker_queue.py
    • flytekit/core/workflow.py
    • tests/flytekit/unit/core/test_worker_queue.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

@flyte-bot
Copy link
Contributor

flyte-bot commented Feb 12, 2025

Changelist by Bito

This pull request implements the following key changes.

Key Change Files Impacted
Bug Fix - Eager Workflow Execution and State Management Fixes

promise.py - Added missing return statement in async call handler

python_function_task.py - Added run method for local testing of eager parent tasks

worker_queue.py - Improved state management and error handling in worker queue reconciliation

workflow.py - Made workflow compilation conditional for eager execution

Testing - Enhanced Worker Queue Testing

test_worker_queue.py - Added comprehensive tests for workflow execution phases and worker queue behavior

@flyte-bot
Copy link
Contributor

flyte-bot commented Feb 12, 2025

Code Review Agent Run #507894

Actionable Suggestions - 0
Review Details
  • Files reviewed - 5 · Commit Range: 8b2d54c..1c853cf
    • flytekit/core/python_function_task.py
    • flytekit/core/worker_queue.py
    • flytekit/core/workflow.py
    • flytekit/image_spec/default_builder.py
    • tests/flytekit/unit/core/test_worker_queue.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

flytekit/core/promise.py Show resolved Hide resolved
tests/flytekit/unit/core/test_worker_queue.py Outdated Show resolved Hide resolved
flytekit/core/python_function_task.py Show resolved Hide resolved
flytekit/core/python_function_task.py Show resolved Hide resolved
Signed-off-by: Yee Hing Tong <wild-endeavor@users.noreply.github.com>
@wild-endeavor wild-endeavor merged commit 464493e into master Feb 12, 2025
109 of 113 checks passed
@flyte-bot
Copy link
Contributor

flyte-bot commented Feb 12, 2025

Code Review Agent Run #636e20

Actionable Suggestions - 0
Review Details
  • Files reviewed - 1 · Commit Range: 1c853cf..9f4a4d9
    • tests/flytekit/unit/core/test_worker_queue.py
  • Files skipped - 0
  • Tools
    • Whispers (Secret Scanner) - ✔︎ Successful
    • Detect-secrets (Secret Scanner) - ✔︎ Successful
    • MyPy (Static Code Analysis) - ✔︎ Successful
    • Astral Ruff (Static Code Analysis) - ✔︎ Successful

AI Code Review powered by Bito Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants