First draft convert repronim-containers tutorial for website

ReproNim · Dec 20, 2024 · 9818c19 · 9818c19
1 parent b42abc4
commit 9818c19
Showing 1 changed file with 204 additions and 0 deletions.
diff --git a/content/resources/tutorials/repronim-containers.md b/content/resources/tutorials/repronim-containers.md
@@ -0,0 +1,204 @@
+---
+title: ReproNim Containers and Yoda
+type: docs
+weight: 5 
+---
+
+**ReproPrinciples**: 
+ - 2a: Use **standard** data formats and extend them to meet your needs.
+ - 2b: Use **version control** from start to finish
+ - 2c: **Annotate** data using standard, reproducible procedures
+ - 3a: Use released versions of open source software tools.
+ - 3b: Use **version control** from start to finish
+ - 3c: Automate the installation of your code and its dependencies
+ - 3d: Automate the execution of your data analysis
+ - 3e: **Annotate** your code and workflows using standard, reproducible procedures
+ - 3f: Use **containers** where reasonable
+
+**Actions**:  Standards, Annotation, Containers, Version Control
+**Standards**:  BIDS 
+**Tools**: ReproNim Containers, Singularity, Datalad
+
+# Challenge
+
+- layout
+- annotation of procedure using version control
+- annotation of software versions 
+- automation and replicability of procedures
+
+Using version control and automation to execute procedures can produce re-executable and provenance-rich results, but the task can appear daunting. 
+Following best-practices for file layouts (Datalad + YODA Principles) provide clear connections (via subdatasets) between the source data and the derivative data that is produced.
+Additionally, utilizing `datalad run` with `repronim-containers` preserves the provenance of exactly what software versions were used and how, leaving a detailed trail for future work.
+
+# Exercise:
+
+Let's assume that our goal is to do Quality Control of an MRI dataset
+(which is available as DataLad dataset ds000003). We will create a new
+dataset with the output of the QC results (as analyzed by mriqc
+BIDS-App). 
+
+- create a new dataset which would contain results and everything needed
+    to obtain them
+- install/add subdatasets(code, other datasets, containers)
+- perform the analysis using **only** materials available within the reach of this dataset.
+
+This would help to guarantee reproducibility in the future because all the
+materials would be *reachable* within that dataset.
+
+Note: This exercise is based on the [ReproNim/containers README](https://github.com/ReproNim/containers/), which should be referenced for more information.
+
+# Step by step guide
+
+#### Step 1: Installing the Necessary Tools
+
+The following tools should be installed:
+
+- [Datalad](https://handbook.datalad.org/en/latest/intro/installation.html)
+- [Singularity/Apptainer](https://apptainer.org/docs/admin/main/installation.html)
+
+Additionally, the `datalad-container` extension should also be installed.
+
+```bash
+pip install datalad-container
+```
+
+#### Step 2: Start a Datalad dataset
+
+Following YODA, our dataset for the results is **the** dataset that will contain everything needed to produce those results. 
+
+```bash
+mkdir ~/my-experiments
+cd ~/my-experiments
+datalad create -d ds000003-qc -c text2git
+cd ds000003-qc
+```
+#### Step 3: Install source data
+
+Next we install our source data as a subdataset.
+
+```bash
+datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata
+```
+
+#### Step 4: Install ReproNim/containers
+
+Next we install the `ReproNim/containers` collection.
+
+```bash
+datalad install -d . -s ///repronim/containers code/containers
+```
+
+Now let's take a look at what we have.
+
+```
+/ds000003-qc # The root dataset contains everything
+ |--/sourcedata # we call it source, but it is actually ds000003-demo
+ |--/code/containers # repronim/containers, this is where our non-custom code lives
+```
+
+#### Step 4: Freezing Container Image Versions
+
+`freeze_versions` is an optional step that will record and "freeze" the
+version of the container used. Even if the `///repronim/containers` dataset is
+upgraded with a newer version of our container, we are "pinned" to the
+container we explicitly determined. Note: To switch version of the container
+(e.g., to upgrade to a new one), rerun `freeze_versions` script with the version
+specified.
+
+The container version can be "frozen" into the clone of the `///repronim/containers`
+dataset, **or** the top-level dataset.
+
+
+**Option 1: Top level dataset (recommended)**
+
+```bash
+# Run from ~/my-experiments/ds000003-qc
+datalad run -m "Downgrade/Freeze mriqc container version" \
+  code/containers/scripts/freeze_versions --save-dataset=. bids-mriqc=0.16.0
+```
+
+**Option 2: ///repronim/containers**
+
+```bash
+# Run from ~/my-experiments/ds000003-qc/
+datalad run -m "Downgrade/Freeze mriqc container version" \
+    code/containers/scripts/freeze_versions bids-mriqc=0.16.0
+```
+
+Note: It is recommended to freeze a container image version into the
+top-level dataset to simplify reuse. If `///repronim/containers` is
+modified in any way, the author must ensure that their altered fork of
+`///repronim/containers` is publicly available and that its URL
+specified in the `.gitmodules`. By freezing into the top-level dataset
+instead, authors do not need to host a modified version of
+`///reporonim/containers`.
+
+#### Step 5: Running the Containers
+
+When we run the bids-mriqc container, it will need a working directory
+for intermediate files. These are not helpful to commit, so we will
+tell `git` (and `datalad`) to ignore the whole directory.
+
+```bash
+echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore
+```
+
+Now we use `datalad containers-run` to perform the analysis.
+
+```bash
+datalad containers-run \
+        -n bids-mriqc \
+        --input sourcedata \
+        --output . \
+        '{inputs}' '{outputs}' participant group -w workdir
+```
+
+If everything worked as expected, we will now see our new analysis, and
+a commit message of how it was obtained! All of this is contained within
+a single (nested) dataset with a complete record of how all the data was
+obtained.
+
+```shell
+(git) .../ds000003-qc[master] $ git show --quiet
+Author: Austin <austin@dartmouth.edu>
+Date:   Wed Jun 5 15:41:59 2024 -0400
+
+    [DATALAD RUNCMD] ./code/containers/scripts/singularity_cm...
+
+    === Do not change lines below ===
+    {
+     "chain": [],
+     "cmd": "./code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-mriqc--0.16.0.sing '{inputs}' '{outputs}' participant group -w workdir",
+     "dsid": "c9c96ab9-f803-43ba-83e2-2eaec7ab4725",
+     "exit": 0,
+     "extra_inputs": [
+      "code/containers/images/bids/bids-mriqc--0.16.0.sing"
+     ],
+     "inputs": [
+      "sourcedata"
+     ],
+     "outputs": [
+      "."
+     ],
+     "pwd": "."
+    }
+    ^^^ Do not change lines above ^^^
+```
+
+This record could later be reused (by anyone) using [datalad rerun] to rerun
+this computation using exactly the same version(s) of input data and the
+singularity container. You can even now [datalad uninstall] sourcedata and even containers
+sub-datasets to save space - they will be retrievable at those exact versions later
+on if you need to extend or redo your analysis.
+
+#### Notes:
+
+- aforementioned example requires DataLad >= 0.11.5 and datalad-containers >= 0.4.0;
+- for more eleborate example with use of [reproman] to parallelize execution on
+  remote resources, see [ReproNim/reproman PR#438](https://github.com/ReproNim/reproman/pull/438);
+- a copy of the dataset is made available from [`///repronim/ds000003-qc`](http://datasets.datalad.org/?dir=/repronim/ds000003-qc)
+  and [https://github.com/ReproNim/ds000003-qc]().
+
+[reproman]: http://reproman.repronim.org
+[datalad rerun]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-rerun.html
+[datalad uninstall]: http://docs.datalad.org/projects/container/en/latest/generated/man/datalad-uninstall.html