Skip to content

Commit

Permalink
First draft convert repronim-containers tutorial for website
Browse files Browse the repository at this point in the history
  • Loading branch information
asmacdo committed Dec 20, 2024
1 parent b42abc4 commit 9818c19
Showing 1 changed file with 204 additions and 0 deletions.
204 changes: 204 additions & 0 deletions content/resources/tutorials/repronim-containers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: ReproNim Containers and Yoda
type: docs
weight: 5
---

**ReproPrinciples**:
- 2a: Use **standard** data formats and extend them to meet your needs.
- 2b: Use **version control** from start to finish
- 2c: **Annotate** data using standard, reproducible procedures
- 3a: Use released versions of open source software tools.
- 3b: Use **version control** from start to finish
- 3c: Automate the installation of your code and its dependencies
- 3d: Automate the execution of your data analysis
- 3e: **Annotate** your code and workflows using standard, reproducible procedures
- 3f: Use **containers** where reasonable

**Actions**: Standards, Annotation, Containers, Version Control
**Standards**: BIDS
**Tools**: ReproNim Containers, Singularity, Datalad

# Challenge

- layout
- annotation of procedure using version control
- annotation of software versions
- automation and replicability of procedures

Using version control and automation to execute procedures can produce re-executable and provenance-rich results, but the task can appear daunting.
Following best-practices for file layouts (Datalad + YODA Principles) provide clear connections (via subdatasets) between the source data and the derivative data that is produced.
Additionally, utilizing `datalad run` with `repronim-containers` preserves the provenance of exactly what software versions were used and how, leaving a detailed trail for future work.

# Exercise:

Let's assume that our goal is to do Quality Control of an MRI dataset
(which is available as DataLad dataset ds000003). We will create a new
dataset with the output of the QC results (as analyzed by mriqc
BIDS-App).

- create a new dataset which would contain results and everything needed
to obtain them
- install/add subdatasets(code, other datasets, containers)
- perform the analysis using **only** materials available within the reach of this dataset.

This would help to guarantee reproducibility in the future because all the
materials would be *reachable* within that dataset.

Note: This exercise is based on the [ReproNim/containers README](https://github.com/ReproNim/containers/), which should be referenced for more information.

# Step by step guide

#### Step 1: Installing the Necessary Tools

The following tools should be installed:

- [Datalad](https://handbook.datalad.org/en/latest/intro/installation.html)
- [Singularity/Apptainer](https://apptainer.org/docs/admin/main/installation.html)

Additionally, the `datalad-container` extension should also be installed.

```bash
pip install datalad-container
```

#### Step 2: Start a Datalad dataset

Following YODA, our dataset for the results is **the** dataset that will contain everything needed to produce those results.

```bash
mkdir ~/my-experiments
cd ~/my-experiments
datalad create -d ds000003-qc -c text2git
cd ds000003-qc
```
#### Step 3: Install source data

Next we install our source data as a subdataset.

```bash
datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata
```

#### Step 4: Install ReproNim/containers

Next we install the `ReproNim/containers` collection.

```bash
datalad install -d . -s ///repronim/containers code/containers
```

Now let's take a look at what we have.

```
/ds000003-qc # The root dataset contains everything
|--/sourcedata # we call it source, but it is actually ds000003-demo
|--/code/containers # repronim/containers, this is where our non-custom code lives
```

#### Step 4: Freezing Container Image Versions

`freeze_versions` is an optional step that will record and "freeze" the
version of the container used. Even if the `///repronim/containers` dataset is
upgraded with a newer version of our container, we are "pinned" to the
container we explicitly determined. Note: To switch version of the container
(e.g., to upgrade to a new one), rerun `freeze_versions` script with the version
specified.

The container version can be "frozen" into the clone of the `///repronim/containers`
dataset, **or** the top-level dataset.


**Option 1: Top level dataset (recommended)**

```bash
# Run from ~/my-experiments/ds000003-qc
datalad run -m "Downgrade/Freeze mriqc container version" \
code/containers/scripts/freeze_versions --save-dataset=. bids-mriqc=0.16.0
```

**Option 2: ///repronim/containers**

```bash
# Run from ~/my-experiments/ds000003-qc/
datalad run -m "Downgrade/Freeze mriqc container version" \
code/containers/scripts/freeze_versions bids-mriqc=0.16.0
```

Note: It is recommended to freeze a container image version into the
top-level dataset to simplify reuse. If `///repronim/containers` is
modified in any way, the author must ensure that their altered fork of
`///repronim/containers` is publicly available and that its URL
specified in the `.gitmodules`. By freezing into the top-level dataset
instead, authors do not need to host a modified version of
`///reporonim/containers`.

#### Step 5: Running the Containers

When we run the bids-mriqc container, it will need a working directory
for intermediate files. These are not helpful to commit, so we will
tell `git` (and `datalad`) to ignore the whole directory.

```bash
echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore
```

Now we use `datalad containers-run` to perform the analysis.

```bash
datalad containers-run \
-n bids-mriqc \
--input sourcedata \
--output . \
'{inputs}' '{outputs}' participant group -w workdir
```

If everything worked as expected, we will now see our new analysis, and
a commit message of how it was obtained! All of this is contained within
a single (nested) dataset with a complete record of how all the data was
obtained.

```shell
(git) .../ds000003-qc[master] $ git show --quiet
Author: Austin <austin@dartmouth.edu>
Date: Wed Jun 5 15:41:59 2024 -0400

[DATALAD RUNCMD] ./code/containers/scripts/singularity_cm...

=== Do not change lines below ===
{
"chain": [],
"cmd": "./code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-mriqc--0.16.0.sing '{inputs}' '{outputs}' participant group -w workdir",
"dsid": "c9c96ab9-f803-43ba-83e2-2eaec7ab4725",
"exit": 0,
"extra_inputs": [
"code/containers/images/bids/bids-mriqc--0.16.0.sing"
],
"inputs": [
"sourcedata"
],
"outputs": [
"."
],
"pwd": "."
}
^^^ Do not change lines above ^^^
```

This record could later be reused (by anyone) using [datalad rerun] to rerun
this computation using exactly the same version(s) of input data and the
singularity container. You can even now [datalad uninstall] sourcedata and even containers
sub-datasets to save space - they will be retrievable at those exact versions later
on if you need to extend or redo your analysis.

#### Notes:

- aforementioned example requires DataLad >= 0.11.5 and datalad-containers >= 0.4.0;
- for more eleborate example with use of [reproman] to parallelize execution on
remote resources, see [ReproNim/reproman PR#438](https://github.com/ReproNim/reproman/pull/438);
- a copy of the dataset is made available from [`///repronim/ds000003-qc`](http://datasets.datalad.org/?dir=/repronim/ds000003-qc)
and [https://github.com/ReproNim/ds000003-qc]().

[reproman]: http://reproman.repronim.org
[datalad rerun]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-rerun.html
[datalad uninstall]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-uninstall.html

0 comments on commit 9818c19

Please sign in to comment.