diff --git a/docs/basics/101-108-run.rst b/docs/basics/101-108-run.rst index 1d9fde343..1550479f8 100644 --- a/docs/basics/101-108-run.rst +++ b/docs/basics/101-108-run.rst @@ -39,7 +39,7 @@ to generate a text file that lists speaker and title name instead. To do this, we're following a best practice that will reappear in the -later section on YODA principles (todo: link): Collecting all +later section on `YODA principles <101-123-yoda.html>`_ : Collecting all additional scripts that work with content of a subdataset *outside* of this subdataset, in a dedicated ``code/`` directory, and collating the output of the execution of these scripts diff --git a/docs/basics/101-122-intro.rst b/docs/basics/101-122-intro.rst new file mode 100644 index 000000000..053b128f2 --- /dev/null +++ b/docs/basics/101-122-intro.rst @@ -0,0 +1,23 @@ +.. _intromidterm: + +A Data Analysis Project with DataLad +------------------------------------ + + +Time flies and the semester rapidly approaches the midterms. +In DataLad-101, students are not given an exam -- instead, they are +asked to complete and submit a data analysis project with DataLad. + +The lecturer hands out the requirements: The projects needs to + +- be prepared in the form of a DataLad dataset +- needs to contain a data analysis performed with Python tools +- should incorporate DataLad whenever possible (data retrieval, publication, + script execution, general version control) and +- needs to comply to the YODA principles + +Luckily, the midterms are only in a couple of weeks, and a lot of the +requirements of the project will be taught in the upcoming sessions. +Therefore, there's little you can do to prepare for the midterm +than to be extra attentive on the next lectures on the YODA +principles and DataLads Python API. \ No newline at end of file diff --git a/docs/basics/101-123-yoda.rst b/docs/basics/101-123-yoda.rst new file mode 100644 index 000000000..edca98b04 --- /dev/null +++ b/docs/basics/101-123-yoda.rst @@ -0,0 +1,461 @@ +.. _yoda: + + +YODA: Best practices for data analyses in a dataset +--------------------------------------------------- + + + +The last requirement for the midterm projects reads "needs to comply to the +YODA principles". +"What are the YODA principles?" you ask, as you have never heard of this +before. +"The topic of today's lecture: Organizational principles of data +analyses in DataLad datasets. This lecture will show you the basic +principles behind creating, sharing, and publishing reproducible, +understandable, and open data analysis projects with DataLad." + +The starting point... +^^^^^^^^^^^^^^^^^^^^^ + +Data analyses projects are very common, both in science and industry. +But it can be very difficult to produce a reproducible, let alone +*comprehensible* data analysis project. +Many data analysis projects do not start out with +a stringent organization, or fail to keep the structural organization of a +directory intact as the project develops. Often, this can be due a lack of +version-control. In these cases, a project will quickly end up +with many +`almost-identical scripts suffixed with "_version_xyz" `_, +or a chaotic results structure split between various directories with names +such as ``results/``, ``results_August19/``, ``results_revision/`` and +``now_with_nicer_plots/``. Something like this for example is a very +common shape a data science project may take after a while: + +.. code-block:: bash + + ├── code/ + │   ├── code_final/ + │   │   ├── final_2/ + │ │ │ ├── main_script_fixed.py + │ │ │ └──takethisscriptformostthingsnow.py + │ │ ├── utils_new.py + │ │ ├── main_script.py + │ │ ├── utils_new.py + │ │ ├── utils_2.py + │ │ └── main_analysis_newparameters.py + │   └── main_script_DONTUSE.py + ├── data/ + │   ├── data_updated/ + │   │   └── dataset1/ + │   │   ├── datafile_a + + │   ├── dataset1/ + │   │   ├── datafile_a + │ ├── outputs/ + │   │  ├── figures/ + │ │ │ ├── figures_new.py + │ │ │ └── figures_final_forreal.py + │   │ ├── important_results/ + │   │ ├── random_results_file.tsv + │   │ ├── results_for_paper/ + │   │ ├── results_for_paper_revised/ + │   │ └── results_new_data/ + │ ├── random_results_file.tsv + │ ├── random_results_file_v2.tsv + + [...] + +All data analysis endeavours in directories like this *can* work, for a while, +if there is a person who knows the project well, and works on it all the time. +But it inevitably will get messy once anyone tries to collaborate on a project +like this, or simply goes on a two-week vacation and forgets whether +the function in ``main_analysis_newparameters.py`` or the one in +``takethisscriptformostthingsnow.py`` was the one that created a particular figure. + +But even if a project has an intuitive structure, and *is* version +controlled, in many cases an analysis script will stop working, or maybe worse, +will produce different results, because the software and tools used to +conduct the analysis in the first place got an update. This update may have +come with software changes that made functions stop working, or work differently +than before. +In the same vein, recomputing an analysis project on a different machine than +the one the analysis was developed on can fail if the necessary +software in the required versions is not installed or available on this new machine. +The analysis might depend on software that runs on a Linux machine, but the project +was shared with a Windows user. The environment during analysis development used +Python 2, but the new system has only Python 3 installed. Or one of the dependent +libraries needs to be in version X, but is installed as version Y. + +The YODA principles are a clear set of organizational standards for +datasets used for data analysis projects that aim to overcome issues like the +ones outlined above. The name stands for +"YODAs Organigram on Data Analysis" [#f1]_. The principles outlined +in YODA set simple rules for directory names and structures, best-practices for +version-controlling dataset elements and analyses, facilitate +usage of tools to improve the reproducibility and accountability +of data analysis projects, and make collaboration easier. +They are summarized in three basic principles, that translate to both +dataset structures and best practices regarding the analysis: + +- `P1: One thing, one dataset `_ + +- `P2: Record where you got it from, and where it is now `_ + +- `P3: Record what you did to it, and with what `_ + +As you will see, complying to these principles is easy if you +use DataLad. Let's go through them one by one: + +P1: One thing, one dataset +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Whenever a particular collection of files could be useful in more +than one context, make them a standalone, modular component. +In the broadest sense, this means to structure your study elements (data, code, +computational environments, results, ...) in dedicated directories. For example: + + +- Store **input data** for an analysis in a dedicated ``inputs/`` directory. + Keep different formats or processing-stages of your input data as individual, + modular components: Don't mix raw data, data that is already structured + following community guidelines of the given field, or preprocessed data, but create + one data component for each of them. And if your analysis + relies on two or more data collections, these collections should each be an + individual component, not combined into one. + +- Store scripts or **code** used for the analysis of data in a dedicated ``code/`` + directory, outside of the data component of the dataset. + +- Collect **results** of an analysis in a dedicated ``outputs/`` directory, and + leave the input data of an analysis untouched by your computations. + +- Include a place for complete **execution environments**, for example + `singularity images `_ or + `docker containers `_ [#f2]_, in + the form of an ``envs/`` directory, if relevant for your analysis. + +- And if you conduct multiple different analyses, create a dedicated + project for each analysis, instead of conflating them. + +This, for example, would be a directory structure from the root of a +superdataset of a very comprehensive [#f3]_ +data analysis project complying to the YODA principles: + +.. code-block:: bash + + ├── ci/ # continuous integration configuration + │   └── .travis.yml + ├── code/ # your code + │   ├── tests/ # unit tests to test your code + │   │   └── test_myscript.py + │   └── myscript.py + ├── docs # documentation about the project + │   ├── build/ + │ └── source/ + ├── envs # computational environments + │   └── Singularity + ├── inputs/ # dedicated inputs/, will not be changed by an analysis + │   └─── data/ + │      ├── dataset1/ # one stand-alone data component + │      │   └── datafile_a + │      └── dataset2/ + │      └── datafile_a + ├── outputs/ # outputs away from the input data + │   └── important_results/ + │      └── figures/ + ├── CHANGELOG.md # notes for fellow humans about your project + ├── HOWTO.md + └── README.md + + +There are many advantages to this modular way of organizing contents. +Having input data as independent components that are not altered (only +consumed) by an analysis does not conflate the data for +an analysis with the results or the code, thus assisting understanding +the project for anyone unfamiliar with it. +But more than just structure, this organization aids modular reuse or +publication of the individual components, for example data. In a +YODA-compliant dataset, any processing stage of a data component can +be reused in a new project or published and shared. The same is true +for a whole analysis dataset. At one point you might also write a +scientific paper about your analysis in a paper project, and the +whole analysis project can easily become a modular component in a paper +project, to make sharing paper, code, data, and results easy. +The usecase :ref:`remodnav` contains a step-by-step instruction on +how to build and share such a reproducible paper, if you want to learn +more. + +.. figure:: ../img/dataset_modules.svg + :figwidth: 100% + :alt: Modular structure of a data analysis project + + Data are modular components that can be re-used easily. + +The directory tree above and Figure 3 highlight different aspects +of this principle. The directory tree illustrates the structure of +the individual pieces on the file system from the point of view of +a single top-level dataset with a particular purpose. It for example +could be an analysis dataset created by a statistician for a scientific +project, and it could be shared between collaborators or +with others during development of the project. In this +superdataset, code is created that operates on input data to +compute outputs, and the code and outputs are captured, version- +controlled, and linked to the input data. Each input data in turn +is a (potentially nested) subdataset, but this is not visible +in the directory hierarchy. +Figure 3 in comparison emphasizes a process view on a project and +the nested structure of input subdataset: +You can see how the preprocessed data that serves as an input for +the analysis datasets evolves from raw data to +standardized data organization to its preprocessed state. Within +the ``data/`` directory of the file system hierarchy displayed +above one would find data datasets with their previous version as +a subdataset, and this is repeated recursively until one reaches +the raw data as it was originally collected at one point. A finished +analysis project in turn can be used as a component (subdataset) in +a paper project, such that the paper is a fully reproducible research +object that shares code, analysis results, and data, as well as the +history of all of these components. + +Principle 1, therefore, encourages to structure data analysis +projects in a clear and modular fashion that makes use of nested +DataLad datasets, yielding comprehensible structures and re-usable +components. Having each component version-controlled -- +regardless of size -- will aid keeping directories clean and +organized, instead of piling up different versions of code, data, +or results. + +P2: Record where you got it from, and where it is now +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Its good to have data, but its even better if you and anyone you +collaborate or share the project or its components with can find +out where the data came from, or how it +is dependent on or linked to other data. Therefore, this principle +aims to attach this information to the components of +your data analysis project. + +Luckily, this is a no-brainer with DataLad, because the core data structure +of DataLad, the dataset, and many of the DataLad commands already covered +up to now fulfill this principle. + +If data components of a project are DataLad datasets, they can +be included in an analysis superdataset as subdatasets. Thanks to +:command:`datalad install`, information on the source of these subdatasets +is stored in the history of the analysis superdataset, and they can even be +updated from those sources if the original data dataset gets extended or changed. +If you are including a file, for example code from Github, +the :command:`datalad download-url` command (introduced in section :ref:`sibling`) +will record the source of it safely in the dataset's history. And if you add anything to your dataset, +from simple incremental coding progress in your analysis scripts up to +files that a colleague sent you via email, a plain :command:`datalad save` +with a helpful commit message goes a very long way to fulfill this principle +on its own already. + +One core aspect of this principle is *linking* between re-usable data +resource units (i.e. DataLad subdatasets containing pure data). You will +be happy to hear that this is achieved by simply installing datasets +as subdatasets. +This part of this principle will therefore be absolutely obvious to you +because you already know how to install and nest datasets within datasets. +"I might just overcome my impostor syndrome if I experience such advanced +reproducible analysis concepts as being obvious", you think with a grin. + + +.. figure:: ../img/data_origin.svg + :figwidth: 50% + :alt: Datasets are installed as subdatasets + + Schematic illustration of two standalone data datasets installed as subdatasets + into an analysis project. + +But more than linking datasets in a superdataset, linkage also needs to +be established between components of your dataset. Scripts inside of +your ``code/`` directory should point to data not as :term:`absolute path`\s +that would only work on your system, but instead as :term:`relative path`\s +that will work in any shared copy of your dataset. The next section +on DataLads Python API will show concrete examples of this. + +Lastly, this principle also includes *moving*, *sharing*, and *publishing* your +datasets or its components. +It is usually costly to collect data, and economically unfeasible [#f4]_ to keep +it locked in a drawer (or similarly out of reach behind complexities of +data retrieval or difficulties in understanding the data structure). +But conducting several projects on the same dataset yourself, sharing it with +collaborators, or publishing it is easy if the project is a DataLad dataset +that can be installed and retrieved on demand, and is kept clean from +everything that is not part of the data according to principle 1. +Conducting transparent open science is easier if you can link code, data, +and results within a dataset, and share everything together. In conjunction +with principle 1, this means that you can distribute your analysis projects +(or parts of it) in a comprehensible form. + +.. figure:: ../img/decentralized_publishing.svg + :figwidth: 100% + :alt: A full data analysis workflow complying with YODA principles + + In a dataset that complies to the YODA principles, modular components + (data, analysis results, papers) can be shared or published easily. + +Principle 2, therefore, facilitates transparent linkage of datasets and their +components to other components, their original sources, or shared copies. +With the DataLad tools you learned to master up to this point, +you have all the necessary skills to comply to it already. + +P3: Record what you did to it, and with what +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This last principle is about capturing *how exactly the content of +every file came to be* that was not obtained from elsewhere. For example, +this relates to results generated from inputs by scripts or commands. +The section :ref:`run` already outlined the problem of associating +a result with an input and a script. It can be difficult to link a +figure from your data analysis project with an input data file or a +script, even if you created this figure yourself. +The :command:`datalad run` command however mitigates these difficulties, +and captures the provenance of any output generated with a +``datalad run`` call in the history of the dataset. Thus, by using +:command:`datalad run` in analysis projects, your dataset knows +which result was generated when, by which author, from which inputs, +and by means of which command. + +With another DataLad command one can even go one step further: +The command :command:`datalad containers-run` (it will be introduced in +a later part of the book) performs a command execution within +a configured containerized environment. Thus, not only inputs, +outputs, command, time, and author, but also the *software environment* +are captured as provenance of a dataset component such as a results file, +and, importantly, can be shared together with the dataset in the +form of a software container. + + + +.. figure:: ../img/yoda.svg + :figwidth: 30% + :alt: A very cute YODA + + “Feel the force!” + +With this last principle, your dataset collects and stores provenance +of all the contents you created in the wake of your analysis project. +This established trust in your results, and enables others to understand +where files derive from. + +The YODA procedure +^^^^^^^^^^^^^^^^^^ + +There is one tool that can make starting a yoda-compliant data analysis +easier: DataLads ``yoda`` procedure. Just as the ``text2git`` procedure +from section :ref:`createds`, the ``yoda`` procedure can be included in a +:command:`datalad create` command and will apply useful configurations +to your dataset: + +.. code-block:: bash + + $ datalad create -c yoda "my_analysis" + + [INFO ] Creating a new annex repo at /home/adina/repos/testing/my_analysis + create(ok): /home/adina/repos/testing/my_analysis (dataset) + [INFO ] Running procedure cfg_yoda + [INFO ] == Command start (output follows) ===== + [INFO ] == Command exit (modification check follows) ===== + +Let's take a look at what configurations and changes come with this procedure: + +.. code-block:: bash + + $ tree -a + + . + ├── .gitattributes + ├── CHANGELOG.md + ├── code + │   ├── .gitattributes + │ └── README.md + └── README.md + +Let's take a closer look into the ``.gitattributes`` files: + +.. code-block:: bash + + $ less .gitattributes + + **/.git* annex.largefiles=nothing + CHANGELOG.md annex.largefiles=nothing + README.md annex.largefiles=nothing + + $ less code/.gitattributes + + * annex.largefiles=nothing + +Summarizing these two glimpses into the dataset, this configuration has + +#. included a code directory in your dataset +#. included three files for human consumption (``README.md``, ``CHANGELOG.md``) +#. configured everything in the ``code/`` directory to be tracked by Git, not Git-annex [#f5]_ +#. and configured ``README.md`` and ``CHANGELOG.md`` in the root of the dataset to be + tracked by Git. + +Your next data analysis project can thus get a headstart with useful configurations +and the start of a comprehensible directory structure by applying the ``yoda`` procedure. + + + + + + + + +Sources +^^^^^^^ +This section is based on this comprehensive +`poster `_ and these publicly +available `slides `_ about the +YODA principles. + + +.. rubric:: Footnotes + +.. [#f1] "Why does the acronym contain itself?" you ask confused. + "That's because it's a `recursive acronym `_, + where the first letter stands recursively for the whole acronym." you get in response. + "This is a reference to the recursiveness within a DataLad dataset -- all principles + apply recursively to all the subdatasets a dataset has." + "And what does all of this have to do with Yoda?" you ask mildly amused. + "Oh, well. That's just because the DataLad team is full of geeks." + +.. [#f2] If you want to learn more about Docker and Singularity, or general information + about containerized computational environments for reproducible data science, + check out `this section `_ + in the wonderful book `The Turing Way `_, + a comprehensive guide to reproducible data science. + +.. [#f3] This directory structure is very comprehensive, and displays many best-practices for + reproducible data science. For example, + + #. Within ``code/``, it is best practice to add **tests** for the code. + These tests can be run to check whether the code still works. + + #. It is even better to further use automated computing, for example + `continuous integration (CI) systems `_, + to test the functionality of your functions and scripts automatically. + If relevant, the setup for continuous integration frameworks (such as + `Travis `_) lives outside of ``code/``, + in a dedicated ``ci/`` directory. + + #. Include **documents for fellow humans**: Notes in a README.md or a HOWTO.md, + or even proper documentation (for example using in a dedicated ``docs/`` directory. + Within these documents, include all relevant metadata for your analysis. If you are + conducting a scientific study, this might be authorship, funding, + change log, etc. + + If writing tests for analysis scripts or using continuous integration + is a new idea for you, but you want to learn more, check out + `this excellent chapter on testing `_ + in the book `The Turing Way `_. + +.. [#f4] Substitute unfeasible with *wasteful*, *impractical*, or simply *stupid* if preferred. + +.. [#f5] To re-read how ``.gitattributes`` work, go back to section :ref:`config`, and to remind yourself + about how this worked for the ``text2git`` configuration, go back to section :ref:`text2git`. diff --git a/docs/basics/101-124-summary.rst b/docs/basics/101-124-summary.rst new file mode 100644 index 000000000..fce42fac2 --- /dev/null +++ b/docs/basics/101-124-summary.rst @@ -0,0 +1,53 @@ +.. _summary_yoda: + +Summary: YODA principles +------------------------ + +The YODA principles are a small set of guidelines that can make a huge +difference towards reproducibility, comprehensibility, and transparency +in a data analysis project. + +These standards are not complex -- quite the opposite, they are very +intuitive. They structure essential components of a data analysis project -- +data, code, computational environments, and lastly also the results -- +in a modular and practical way, and use basic principles and commands +of DataLad you are already familiar with. + +There are many advantages to this organization of contents. + +- Having input data as independent dataset(s) that are not influenced (only + consumed) by an analysis allows for a modular reuse of pure data datasets, + and does not conflate the data of an analysis with the results or the code. + +- Keeping code within an independent, version-controlled directory, but as a part + of the analysis dataset, makes sharing code easy and transparent, and helps + to keep directories neat and organized. Moreover, + with the data as subdatasets, data and code can be automatically shared together. + +- Including the computational environment into an analysis dataset encapsulates + software and software versions, and thus prevents re-computation failures + (or sudden differences in the results) once + software is updated, and software conflicts arising on different machines + than the one the analysis was originally conducted on. + +- Having all of these components as part of a DataLad dataset allows version + controlling all pieces within the analysis regardless of their size, and + generates provenance for everything, especially if you make use of the tools + that DataLad provides. + +- The yoda procedure is a good starting point to build your next data analysis + project up on. + +Now what can I do with it? +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Using tools that DataLad provides you are able to make the most out of +your data analysis project. The YODA principles are a guide to accompany +you on your path to reproducibility. + +What should have become clear in this section is that you are already +equipped with enough DataLad tools and knowledge that complying to these +standards will feel completely natural and effortless in your next analysis +project. +The next section will add to your existing skills by demonstrating how to +use DataLad also within Python scripts. \ No newline at end of file diff --git a/docs/contents.rst.inc b/docs/contents.rst.inc index b632ffc7d..2f86c0fb8 100644 --- a/docs/contents.rst.inc +++ b/docs/contents.rst.inc @@ -93,6 +93,18 @@ Help yourself basics/101-135-help +######################### +Data analyses in datasets +######################### + +.. toctree:: + :maxdepth: 1 + :caption: Organizational principles and best practices for data analyses + + basics/101-122-intro + basics/101-123-yoda + basics/101-124-summary + ######### Use Cases ######### diff --git a/docs/img/data_origin.svg b/docs/img/data_origin.svg new file mode 100644 index 000000000..4c01d0708 --- /dev/null +++ b/docs/img/data_origin.svg @@ -0,0 +1,303 @@ + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + 1 + + + + + + + + + + 2 + + + + + + + + + + 2 + 1 + + + + + + + diff --git a/docs/img/dataset_modules.svg b/docs/img/dataset_modules.svg new file mode 100644 index 000000000..38c4d525c --- /dev/null +++ b/docs/img/dataset_modules.svg @@ -0,0 +1,5153 @@ + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/img/decentralized_publishing.svg b/docs/img/decentralized_publishing.svg new file mode 100644 index 000000000..15d8bbf31 --- /dev/null +++ b/docs/img/decentralized_publishing.svg @@ -0,0 +1,7332 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + metadata access + + + data access + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Virtualdata portal + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + PUBLISH + + + + + + + + + PUBLISH + + + + + + Pristine raw data + + + + ARCHIVE + + + + + + Standardized data structure + suitable for data publication + + + + PUBLISH + + + + + + Preprocessed data + starting point for analyses + + + + PUBLISH + + + + + + + + + + + + + + Paper + B + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Raw + data + + + + + + Normalized + + + + + + Analysis + A + + + + + + Paper + A + + + + + + + + + + + + + Analysis + B + + + + + + Preprocessed + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 1 2 3 4 5 6 7 8 9 6 7 11 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Publiccloudstorage + Local shared access storage + Institutionalstorage + + diff --git a/docs/img/yoda.png b/docs/img/yoda.png new file mode 100644 index 000000000..f4e11937f Binary files /dev/null and b/docs/img/yoda.png differ diff --git a/docs/img/yoda.svg b/docs/img/yoda.svg new file mode 100644 index 000000000..4a433c8b7 --- /dev/null +++ b/docs/img/yoda.svg @@ -0,0 +1,155 @@ + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/intro/narrative.rst b/docs/intro/narrative.rst index 68a0c037a..6e749f824 100644 --- a/docs/intro/narrative.rst +++ b/docs/intro/narrative.rst @@ -109,6 +109,51 @@ an additional chapter if you believe it's a worthwhile addition, or with a :command:`datalad` tag if you need help. +What you will learn in this book +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +This handbook will teach you simple and yet advanced principles of data +management for reproducible, comprehensible, transparent, and +`FAIR `_ data +projects. It does so with hands-on tool use of DataLad and its +underlying software, blended with clear explanations of relevant +theoretical backgrounds whenever necessary, and by demonstrating +organizational and procedural guidelines and standards for data +related projects on concrete examples. + +You will learn how to create, consume, structure, share, publish, and use +*DataLad datasets*: modular, reusable components that can be version-controlled, +linked, and that are able to capture and track full provenance of their +contents, if used correctly. + +At the end of the ``Basics`` section, these are some of the main +things you will know how to do, and understand why doing them is useful: + +- **Version-control** data objects, regardless of size, keep track of + and **update** (from) their sources and shared copies, and capture the + **provenance** of all data objects whether you consume them from any source + or create them yourself. + +- **Build up complete projects** with data as independent, version-controlled, + provenance-tracked, and linked DataLad dataset(s) that allow **distribution**, + modular **reuse**, and are **transparent** both in their structure and their + development to their current and future states. + +- **Bind** modular components into complete data analysis projects, and comply + to procedural and organizational principles that will help to create transparent + and comprehensible projects to ease **collaboration** and **reproducibility**. + +- **Share** complete data objects, version-controlled as a whole, but including + modular components such as data in a way that preserves the history, + provenance, and linkage of its components. + +At the end of this handbook, you will find it easy to create, build up, and +share intuitively structured and version-controlled data projects that +fulfill high standards for reproducibility and FAIRness. You are able to +decide for yourself into how much of the DataLad world you want to dive in +based on your individual use cases, and with every section you will learn +more about state-of-the-art data management. + The storyline ^^^^^^^^^^^^^ diff --git a/docs/usecases/reproducible-paper.rst b/docs/usecases/reproducible-paper.rst index 2d4c2cb5c..2cd6979cb 100644 --- a/docs/usecases/reproducible-paper.rst +++ b/docs/usecases/reproducible-paper.rst @@ -444,7 +444,7 @@ Any questions can be asked by `opening an issue +.. [#f1] You can read up on the YODA principles again in section :ref:`yoda` .. [#f2] You can read up on installing datasets as subdatasets again in section :ref:`installds`.