generated from imfing/hextra-starter-template
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
First draft convert repronim-containers tutorial for website
- Loading branch information
Showing
1 changed file
with
204 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,204 @@ | ||
--- | ||
title: ReproNim Containers and Yoda | ||
type: docs | ||
weight: 5 | ||
--- | ||
|
||
**ReproPrinciples**: | ||
- 2a: Use **standard** data formats and extend them to meet your needs. | ||
- 2b: Use **version control** from start to finish | ||
- 2c: **Annotate** data using standard, reproducible procedures | ||
- 3a: Use released versions of open source software tools. | ||
- 3b: Use **version control** from start to finish | ||
- 3c: Automate the installation of your code and its dependencies | ||
- 3d: Automate the execution of your data analysis | ||
- 3e: **Annotate** your code and workflows using standard, reproducible procedures | ||
- 3f: Use **containers** where reasonable | ||
|
||
**Actions**: Standards, Annotation, Containers, Version Control | ||
**Standards**: BIDS | ||
**Tools**: ReproNim Containers, Singularity, Datalad | ||
|
||
# Challenge | ||
|
||
- layout | ||
- annotation of procedure using version control | ||
- annotation of software versions | ||
- automation and replicability of procedures | ||
|
||
Using version control and automation to execute procedures can produce re-executable and provenance-rich results, but the task can appear daunting. | ||
Following best-practices for file layouts (Datalad + YODA Principles) provide clear connections (via subdatasets) between the source data and the derivative data that is produced. | ||
Additionally, utilizing `datalad run` with `repronim-containers` preserves the provenance of exactly what software versions were used and how, leaving a detailed trail for future work. | ||
|
||
# Exercise: | ||
|
||
Let's assume that our goal is to do Quality Control of an MRI dataset | ||
(which is available as DataLad dataset ds000003). We will create a new | ||
dataset with the output of the QC results (as analyzed by mriqc | ||
BIDS-App). | ||
|
||
- create a new dataset which would contain results and everything needed | ||
to obtain them | ||
- install/add subdatasets(code, other datasets, containers) | ||
- perform the analysis using **only** materials available within the reach of this dataset. | ||
|
||
This would help to guarantee reproducibility in the future because all the | ||
materials would be *reachable* within that dataset. | ||
|
||
Note: This exercise is based on the [ReproNim/containers README](https://github.com/ReproNim/containers/), which should be referenced for more information. | ||
|
||
# Step by step guide | ||
|
||
#### Step 1: Installing the Necessary Tools | ||
|
||
The following tools should be installed: | ||
|
||
- [Datalad](https://handbook.datalad.org/en/latest/intro/installation.html) | ||
- [Singularity/Apptainer](https://apptainer.org/docs/admin/main/installation.html) | ||
|
||
Additionally, the `datalad-container` extension should also be installed. | ||
|
||
```bash | ||
pip install datalad-container | ||
``` | ||
|
||
#### Step 2: Start a Datalad dataset | ||
|
||
Following YODA, our dataset for the results is **the** dataset that will contain everything needed to produce those results. | ||
|
||
```bash | ||
mkdir ~/my-experiments | ||
cd ~/my-experiments | ||
datalad create -d ds000003-qc -c text2git | ||
cd ds000003-qc | ||
``` | ||
#### Step 3: Install source data | ||
|
||
Next we install our source data as a subdataset. | ||
|
||
```bash | ||
datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata | ||
``` | ||
|
||
#### Step 4: Install ReproNim/containers | ||
|
||
Next we install the `ReproNim/containers` collection. | ||
|
||
```bash | ||
datalad install -d . -s ///repronim/containers code/containers | ||
``` | ||
|
||
Now let's take a look at what we have. | ||
|
||
``` | ||
/ds000003-qc # The root dataset contains everything | ||
|--/sourcedata # we call it source, but it is actually ds000003-demo | ||
|--/code/containers # repronim/containers, this is where our non-custom code lives | ||
``` | ||
|
||
#### Step 4: Freezing Container Image Versions | ||
|
||
`freeze_versions` is an optional step that will record and "freeze" the | ||
version of the container used. Even if the `///repronim/containers` dataset is | ||
upgraded with a newer version of our container, we are "pinned" to the | ||
container we explicitly determined. Note: To switch version of the container | ||
(e.g., to upgrade to a new one), rerun `freeze_versions` script with the version | ||
specified. | ||
|
||
The container version can be "frozen" into the clone of the `///repronim/containers` | ||
dataset, **or** the top-level dataset. | ||
|
||
|
||
**Option 1: Top level dataset (recommended)** | ||
|
||
```bash | ||
# Run from ~/my-experiments/ds000003-qc | ||
datalad run -m "Downgrade/Freeze mriqc container version" \ | ||
code/containers/scripts/freeze_versions --save-dataset=. bids-mriqc=0.16.0 | ||
``` | ||
|
||
**Option 2: ///repronim/containers** | ||
|
||
```bash | ||
# Run from ~/my-experiments/ds000003-qc/ | ||
datalad run -m "Downgrade/Freeze mriqc container version" \ | ||
code/containers/scripts/freeze_versions bids-mriqc=0.16.0 | ||
``` | ||
|
||
Note: It is recommended to freeze a container image version into the | ||
top-level dataset to simplify reuse. If `///repronim/containers` is | ||
modified in any way, the author must ensure that their altered fork of | ||
`///repronim/containers` is publicly available and that its URL | ||
specified in the `.gitmodules`. By freezing into the top-level dataset | ||
instead, authors do not need to host a modified version of | ||
`///reporonim/containers`. | ||
|
||
#### Step 5: Running the Containers | ||
|
||
When we run the bids-mriqc container, it will need a working directory | ||
for intermediate files. These are not helpful to commit, so we will | ||
tell `git` (and `datalad`) to ignore the whole directory. | ||
|
||
```bash | ||
echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore | ||
``` | ||
|
||
Now we use `datalad containers-run` to perform the analysis. | ||
|
||
```bash | ||
datalad containers-run \ | ||
-n bids-mriqc \ | ||
--input sourcedata \ | ||
--output . \ | ||
'{inputs}' '{outputs}' participant group -w workdir | ||
``` | ||
|
||
If everything worked as expected, we will now see our new analysis, and | ||
a commit message of how it was obtained! All of this is contained within | ||
a single (nested) dataset with a complete record of how all the data was | ||
obtained. | ||
|
||
```shell | ||
(git) .../ds000003-qc[master] $ git show --quiet | ||
Author: Austin <austin@dartmouth.edu> | ||
Date: Wed Jun 5 15:41:59 2024 -0400 | ||
|
||
[DATALAD RUNCMD] ./code/containers/scripts/singularity_cm... | ||
|
||
=== Do not change lines below === | ||
{ | ||
"chain": [], | ||
"cmd": "./code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-mriqc--0.16.0.sing '{inputs}' '{outputs}' participant group -w workdir", | ||
"dsid": "c9c96ab9-f803-43ba-83e2-2eaec7ab4725", | ||
"exit": 0, | ||
"extra_inputs": [ | ||
"code/containers/images/bids/bids-mriqc--0.16.0.sing" | ||
], | ||
"inputs": [ | ||
"sourcedata" | ||
], | ||
"outputs": [ | ||
"." | ||
], | ||
"pwd": "." | ||
} | ||
^^^ Do not change lines above ^^^ | ||
``` | ||
|
||
This record could later be reused (by anyone) using [datalad rerun] to rerun | ||
this computation using exactly the same version(s) of input data and the | ||
singularity container. You can even now [datalad uninstall] sourcedata and even containers | ||
sub-datasets to save space - they will be retrievable at those exact versions later | ||
on if you need to extend or redo your analysis. | ||
|
||
#### Notes: | ||
|
||
- aforementioned example requires DataLad >= 0.11.5 and datalad-containers >= 0.4.0; | ||
- for more eleborate example with use of [reproman] to parallelize execution on | ||
remote resources, see [ReproNim/reproman PR#438](https://github.com/ReproNim/reproman/pull/438); | ||
- a copy of the dataset is made available from [`///repronim/ds000003-qc`](http://datasets.datalad.org/?dir=/repronim/ds000003-qc) | ||
and [https://github.com/ReproNim/ds000003-qc](). | ||
|
||
[reproman]: http://reproman.repronim.org | ||
[datalad rerun]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-rerun.html | ||
[datalad uninstall]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-uninstall.html |