Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design option: wrap datalad inside CWL #9

Closed
mih opened this issue May 2, 2024 · 1 comment
Closed

Design option: wrap datalad inside CWL #9

mih opened this issue May 2, 2024 · 1 comment

Comments

@mih
Copy link
Member

mih commented May 2, 2024

The original thinking was to create a new API layer around a new run-record with a new executioner (special remote). It is worth taking a step back and reevaluate the encapsulation with CWL having entered the picture. Rationale: If we adopt CWL, we might as well make it maximally useful, and not just an internal tool.

A main attraction of CWL is that it is its own ecosystem, and connecting rather than reimplementing is good. Having compute instruction defined as CWL "steps" that could be linked into larger workflows and executed (outside datalad) via standard batch systems would be great. In such a scenario, we would need to make sure that the versioning precision and data provisioning capabilities of datalad remain available.

One way to achieve this would be to have a dedicated provisioning workflow step. It would use a dedicated tool to create a suitable working environment for a subsequent payload computation step. This datalad-tool could obtain/checkout/pre-populate a dataset from any supported source/identifier. And then hand-over to the next step, in which standard CWL input types like File make sense, and are sufficient.

This also had the advantage that also the provisioning provenance would be automatically captured.

It would also be a "loose" CWL integration because using any of these tools inside or outside CWL makes sense, and is possible without making either system aware of the other.

There is also no need to have an exclusive datalad-tool as the provisioning solution. It would be perfectly fine to have a series of git-clone/annex-init/annex-get commands.

A remake special remote could then make smart decisions. It could

  • run cwltool directly on the worktree of a dataset (whenever it has the right version and all needed content present)
  • auto-generate a workflow that uses a provisioning helper to build an adequate worktree for a CWL payload-workflow to run
@mih mih converted this from a draft issue May 2, 2024
@mih
Copy link
Member Author

mih commented May 15, 2024

Closing. Continued in #14

@mih mih closed this as completed May 15, 2024
@github-project-automation github-project-automation bot moved this from discussion needed to done in DataLad remake May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

No branches or pull requests

1 participant