Skip to content

apuapaquola/rf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rf - A minimalist framework for reproducible computation

pip install git+https://github.com/apuapaquola/rf.git

Preprint

http://biorxiv.org/content/early/2015/12/09/033654

Overview

We propose a simple and intuitive way to organize computational analyses using a directory structure constructed according to 3 simple principles.

Consider the following directory structure:

.
└── nodeA
    ├── _h
    ├── _m
    ├── nodeB
    │   ├── _h
    │   └── _m
    └── nodeC
        ├── _h
        └── _m

In this tree, nodeA has two children: nodeB and nodeC.

We can think of these nodes as steps in a computational pipeline, in which nodeB and nodeC depend on the results of computation performed in nodeA.

This is principle 1: use of a directory structure to represent dependencies between analysis steps.

Each node has two special subdirectories: _h and _m with distinct purposes. We put documentation, code and other human-generated data that describe this analysis step in directory _h. For this reason, we call _h the "human" directory. Similarly, we use directory _m to store the results of computation of this analysis step. For this reason, we call _m the "machine" directory.

This is principle 2: separation of user-generated data from program-generated data.

In the "human" directory we put a file named run. run is a script that is supposed to be run without arguments from the "machine" directory. This script is responsible to call the necessary programs that will do the computation in the analysis step and generate the contents of _m.

This is principle 3: use of driver scripts. [doi: 10.1371/journal.pcbi.1000424]

These 3 principles are desirable to help keep analysis organized, reproducible and easier to understand.

A directory structure is a intuitive way to represent data dependencies. Let's say we are at some _m directory looking at output files, and we wonder how these files were generated. A pwd command will display the full path to that directory, which has a sequence of names of analysis steps involved in the generation of these files.

Separation of computer-generated data from human-generated data is also nice. It is a way to make sure that users don't edit output files. It is also useful to know which files are program-generated, so we know which files are OK to delete because they can be computed again.

Running driver scripts without arguments is a way to make sure computation doesn't depend on manually specified parameters, which are easy to forget.

Version control

The separation of computer-generated data from human-generated data makes it easy to use version control systems for an analysis tree.

In the current implementation we use git for _h and git-annex for _m. For some operations that involve more than one call to git or git annex, we provide a wrapper command rf.

Using git, users can collaborate and share analyses trees in a similar they can do with code.

Status

This framework is in early stage of development, and contributors are very welcome.

Current work includes, and some of these will be available soon:

  • Apptainer and Docker support.
  • More and better documentation.
  • Concrete examples of analysis, mostly focusing on Bioinformatics.
  • Use cases.
  • Improvement of the manuscript.
  • Support for git lfs.

Contributing

Please do!

License

GNU GPLv3

Releases

No releases published

Packages

No packages published

Languages