Design option: wrap datalad inside CWL #9

mih · 2024-05-02T07:38:45Z

The original thinking was to create a new API layer around a new run-record with a new executioner (special remote). It is worth taking a step back and reevaluate the encapsulation with CWL having entered the picture. Rationale: If we adopt CWL, we might as well make it maximally useful, and not just an internal tool.

A main attraction of CWL is that it is its own ecosystem, and connecting rather than reimplementing is good. Having compute instruction defined as CWL "steps" that could be linked into larger workflows and executed (outside datalad) via standard batch systems would be great. In such a scenario, we would need to make sure that the versioning precision and data provisioning capabilities of datalad remain available.

One way to achieve this would be to have a dedicated provisioning workflow step. It would use a dedicated tool to create a suitable working environment for a subsequent payload computation step. This datalad-tool could obtain/checkout/pre-populate a dataset from any supported source/identifier. And then hand-over to the next step, in which standard CWL input types like File make sense, and are sufficient.

This also had the advantage that also the provisioning provenance would be automatically captured.

It would also be a "loose" CWL integration because using any of these tools inside or outside CWL makes sense, and is possible without making either system aware of the other.

There is also no need to have an exclusive datalad-tool as the provisioning solution. It would be perfectly fine to have a series of git-clone/annex-init/annex-get commands.

A remake special remote could then make smart decisions. It could

run cwltool directly on the worktree of a dataset (whenever it has the right version and all needed content present)
auto-generate a workflow that uses a provisioning helper to build an adequate worktree for a CWL payload-workflow to run

The text was updated successfully, but these errors were encountered:

mih · 2024-05-15T08:29:48Z

Closing. Continued in #14

mih added this to DataLad remake May 2, 2024

mih converted this from a draft issue May 2, 2024

mih mentioned this issue May 2, 2024

Implement datalad data source/sink #10

Open

mih mentioned this issue May 15, 2024

CWL-aligned design/implementation #14

Open

1 task

mih closed this as completed May 15, 2024

github-project-automation bot moved this from discussion needed to done in DataLad remake May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design option: wrap datalad inside CWL #9

Design option: wrap datalad inside CWL #9

mih commented May 2, 2024

mih commented May 15, 2024

Design option: wrap datalad inside CWL #9

Design option: wrap datalad inside CWL #9

Comments

mih commented May 2, 2024

mih commented May 15, 2024