The third and final step of this journey is presenting your research to the community. Your main goal should be to share and maintain an open and reproducible project that can sustain community engagement over time. In this section, we will distinguish three sub-goals to make your research: (1) accessible, (2) reproducible, and (3) sustainable. The latter is especially relevant when your research involves developing code that will be used by others in the future (e.g., a tool or workflow), but we believe that our recommendations are relevant to any computational biology project.
Making your research accessible includes ensuring that anyone can access your research long after your paper is published. It is extremely frustrating for any researcher to look for software or a set of scripts from a paper published a few years ago, only to find a “404 error” when accessing the source weblink. Equally frustrating is when authors offer code as "available upon reasonable request,” as this often leads to dead-ends and unavailable code.
There are three main ways to publish accompanying code: the supplementary material of the manuscript, privately-owned domains, or uploaded to public repositories. Publishing code as supplementary material has low accessibility for non-open access papers. Moreover, the code will remain completely static and cannot be updated with new features or to correct errors. Making code available via privately-owned domains lacks sustainability, as it requires maintenance of the domain. Therefore, in addition to providing the code as supplementary material and/or via private domains, we recommend uploading it to public repositories, enabling open access and sustainability over time. There are several hosting services for this purpose [@https://github.com;@https://gitlab.com;@https://bitbucket.org] (Table @tbl:community-tools-1), all equally valid and typically dependent on established practices in your specific field.
Goal | Tool options | Additional remarks |
---|---|---|
Publish your code | • GitHub [@https://github.com] • GitLab [@https://gitlab.com] • Bitbucket [@https://bitbucket.org] |
All three options allow you to host your repository online for free. Choose whichever is more common in your own field. |
Introduce your code | • README file [@https://www.makeareadme.com]: First file that shows up in a repository. |
Provide a landing page to any repository with a short overview of the code (installation, usage, acknowledgments, etc). |
Share your code | • Several licensing options [@https://choosealicense.com/licenses]. | Indicate with a license file what restrictions apply when using your code. If you don't include this, you will loose many users. |
Archive your code | • Github Releases [@https://docs.github.com/github/administering-a-repository/managing-releases-in-a-repository] • Zenodo [@https://zenodo.org]: Provides DOI. • figshare [@https://figshare.com/about]: Provides DOI. |
Share progressive stable versions of your code as you develop it. Use semantic versioning [@https://semver.org] for assigning standard identifiers to your releases. |
Publish a tool | • PyPI [@https://pypi.org]: Python. • CRAN [@https://cran.r-project.org]: R. • Bioconductor [@https://www.bioconductor.org]: R. • Bioconda [@https://bioconda.github.io]: Language-agnostic. |
Produce a package easy to install and use. Especially useful if you think you could have a userbase that will run the same analysis as you on other datasets and/or conditions. |
Publish an interactive web app | • Dash [@https://plotly.com/dash]: Python. • R-Shiny [@https://shiny.rstudio.com]: R. |
Provide easy and interactive data exploration to your users. Especially useful if you have large datasets that can be explored in different ways. |
Table: Tools for making your research accessible. {#tbl:community-tools-1}
When publishing your code in a public repository, two files are fundamental to include: A readme file and a license. A readme file [@https://www.makeareadme.com] introduces users to the code (Table @tbl:community-tools-1) and should include a description of its main intended use, an overview of the installation, the most commonly-used commands, contact information of the developers, and acknowledgments, if appropriate. We recommend keeping the readme file short and written in a markup language such as Markdown [@https://daringfireball.net/projects/markdown/syntax] or reStructuredText [@https://docutils.sourceforge.io/rst.html] that will render automatically on the repository's landing page, below the repository file structure.
Adding a license to a repository is also a crucial step (Table @tbl:community-tools-1). Licenses indicate how the code can be used: Is it free to use for any application? Can users modify the code as they please? Does it come with a warranty that it will work? Can it be used for profit? If no license information is provided, researchers might assume that the code is free to use but copyright law in fact prohibits use without explicit permission by the copyright holder [@https://opensource.guide/legal/]. Many options exist for licensing code [@https://choosealicense.com/licenses], from permissive licenses that allow any kind of use with few or no conditions, like the Unlicense and MIT licenses, to more restrictive licenses that enforce disclosing the source and requiring that any adaptation of the code uses the same license, like the GNU licenses. When deciding on a license, as a rule of thumb, consider that the more requirements you add, the fewer potential users you will have, but the more credit you will receive when users utilize your code for their own needs. Academic researchers must also consider what open-source licenses their university supports, as in many cases it will be the university that owns the copyrights.
As a computational biologist, you will likely continue lines of work from scripts or software you have already published. For instance, you could improve the performance of a given function or add a new set of features entirely. Therefore, you should not only be interested in making your code accessible but also in having different versions available. Creating and archiving successive releases of your code (Table @tbl:community-tools-1) allows the organization of different versions of your code as you develop them. GitHub Releases [@https://docs.github.com/github/administering-a-repository/managing-releases-in-a-repository] is one way to maintain versions with minimal effort. Research repositories, such as Zenodo [@https://zenodo.org] or Figshare [@https://figshare.com/about], not only store your code, notebooks, and data, but also provide a DOI for each version allowing it to be included as a citation in a manuscript. This is especially useful when the publication is not available yet or the current version of the code differs widely from what was published. Research repositories can be combined with code repositories; for example, GitHub has a Zenodo integration that will trigger a new archived version every time a new version is released. Regardless of the solution, we recommend keeping logical order to the releases, using a standard such as semantic versioning [@https://semver.org].
In most cases, providing your code as an organized set of scripts and/or notebooks is sufficient for anyone to consult if they wish to reproduce and/or re-utilize it. However, if your code might be used routinely by other researchers, for instance for studying other organisms or other experimental conditions, consider packaging your code as a tool (Table @tbl:community-tools-1) and publishing through a software repository such as Bioconda [@https://bioconda.github.io], PyPI [@https://pypi.org] if written for Python, or CRAN [@https://cran.r-project.org] and Bioconductor [@https://www.bioconductor.org] if written for R. These increase your possible userbase, as published packages are searchable and can be installed locally with minimal effort.
To increase the accessibility of results to users, an interactive web app or data dashboard can be developed (Table @tbl:community-tools-1). Such apps allow users to interact with data by displaying different sets of variables or changing parameter settings (e.g., the significance of a statistical test). Common options for this goal are Dash [@https://plotly.com/dash] for Python, R, and Julia, and Shiny [@https://shiny.rstudio.com] for R. Both platforms can include interactive graphics generated with plotly data visualization libraries [@https://plotly.com/graphing-libraries].
In addition to having accessible code/data, you also need to ensure anyone can execute your code and obtain the same results. This is especially relevant in computational biology where users will come from different backgrounds and experience. A cornerstone for reproducibility is documentation that explains how the code functions and how to practically achieve your same results. We have distinguished four levels of documentation [@https://documentation.divio.com/]:
- Tutorials: A group of lessons that teach the reader how to become a user of your code;
- How-to guides: A set of documents that clarify how to solve common problems/tasks;
- Explanations: Discussions that clarify particular topics related to your code;
- References: Technical descriptions of your code's variables/classes/functions.
The extent of required documentation will depend on the number of expected users and, relatedly, can affect how many users you attract. If you foresee that your code has little usability outside of your own research, documenting each function using docstrings [@https://www.geeksforgeeks.org/python-docstrings] might be sufficient. However, if you aim for a broader userbase, you might want to add a tutorial for beginners, how-to guides for frequently used routines, and explanations for clarifying the science behind your code, which can be re-used in a manuscript. To publish comprehensive documentation online, consider using (1) a standard documentation language such as reStructuredText [@https://docutils.sourceforge.io/rst.html] or Markdown [@https://daringfireball.net/projects/markdown/syntax], and (2) a documentation platform such as Readthedocs [@https://readthedocs.org], Gitbook [@https://www.gitbook.com], or Bookdown [@https://bookdown.org/] (Table @tbl:community-tools-2). Alternatively, you can use a service like GitHub Pages [@https://pages.github.com/] to host the documentation on a dedicated website.
Goal | Tool options | Additional remarks |
---|---|---|
Document your code | • Readthedocs [@https://readthedocs.org]: Uses reStructuredText [@https://docutils.sourceforge.io/rst.html]. • Gitbook [@https://www.gitbook.com]: Uses Markdown [@https://daringfireball.net/projects/markdown/syntax]. • Bookdown [@https://bookdown.org/]: Uses R Markdown [@https://rmarkdown.rstudio.com/]. • Github Pages [@https://pages.github.com]: Separate website. |
Comprehensive documentation: from tutorials and how-to guides all the way down to function documentation based on all compiled docstrings [@https://www.geeksforgeeks.org/python-docstrings]. |
Reproducible environments | • Virtual environment managers: See Table 1. • pip-tools [@https://github.com/jazzband/pip-tools]: Administer several environments in a single project. |
As a recommendation, try having the minimum number of dependencies needed to reproduce your results. |
Reproducible software | • Docker [@https://www.docker.com] • Singularity [@https://sylabs.io] |
Package your research as a container ready to run in any computer. |
Reproducible commands | • Make [@https://www.gnu.org/software/make] | Build a program by following a series of steps in a single Makefile. |
Reproducible workflows | • Workflow management systems: See Table 1. | Run a pipeline of commands on NGS data in a reproducible way. |
Reproducible notebooks | • Interactive notebooks: See Table 2. | Make your notebooks interactive and reproducible. |
Table: Tools for making your research reproducible. {#tbl:community-tools-2}
Another key aspect of reproducibility is software and dependencies installation. To facilitate this process, you can (1) provide configuration instructions, (2) share dependencies with a virtual environment manager, or (3) share a runtime environment as a container. When setting up software from instructions, it is necessary to ensure the user follows a series of sequential commands in a specific order. To automate this process, Linux systems provide the tool GNU Make [@https://www.gnu.org/software/make/]. Virtual environment managers handle dependencies and facilitate software installation by building virtual environments from requirements files. To achieve repeatable environments, however, it is recommended to include the specific version of software and libraries, a practice known as dependency pinning. Tools such as pip-tools [@https://github.com/jazzband/pip-tools] allow to define different Python environments for a single project depending on the type of user (e.g., end-user versus developer).
Beyond dependency trackers, we recommend ensuring your tool functions as expected across computing infrastructures, even between two different operating systems (e.g., Mac and Windows). This can be achieved via containerization, also known as lightweight virtualization (Table @tbl:community-tools-2). Containers are standardized software that packages an entire runtime environment, meaning everything needed to run your tool: code, dependencies, system libraries and binaries, environmental variables, settings, etc. Instructions for deploying containers are stored in read-only templates called images. Two free tools available for creating containers from images are Docker [@https://www.docker.com] and Singularity [@https://sylabs.io]. While Docker is the most popular framework for containerization [@https://insights.stackoverflow.com/survey/2021], HPC clusters with shared filesystems favor Singularity due to security issues. In most cases, this is not a problem, since Singularity is compatible with all Docker images.
Now that your research can be accessed and reproduced by anyone, the final step is to sustain this over time—also known as code maintenance. This is especially relevant if you continue to develop tools by integrating new features requested by users, which can foster a strong community over time. However, even in the case in which your research is a self-contained project, it is important to ensure that the user community can contact you, in case bugs are discovered or parts of your code malfunction due to dependency updates (part of the "software rot" phenomenon [@https://www.techopedia.com/definition/22202/software-rot]). In the following section, we review useful techniques for making your code/software/research sustainable over time.
You can employ a variety of tools to communicate with users, depending on the size of your user base and the scope of questions/comments received (Table @tbl:community-tools-3). For smaller projects, a single-channel solution like Gitter [@https://gitter.im] offers a simple way for anyone in the community to ask questions and the developers to answer in threads. For larger projects, however, it could become unmanageable to have all discussions in the same channel, so a multiple-channel solution (i.e., forums), such as Google groups [@https://groups.google.com/forum/m] is better suited. GitHub also allows issues to be opened, where collaborators or users can inform developers about bugs or ask questions. Additionally, GitHub recently introduced discussions [@https://docs.github.com/en/discussions] to maintain questions organized in different threads.
Goal | Tool options | Additional remarks |
---|---|---|
Tell users how to contact you | • Specific/shorter questions: Gitter [@https://gitter.im]. • Larger issues / how-to's: Google groups [@https://groups.google.com/forum/m], GitHub Discussions [@https://docs.github.com/en/discussions]. |
Provide ways for users to contact you for questions, requests, etc. Remember to visit them periodically! |
Track to-do's in your research | • Github Issues [@https://guides.github.com/features/issues] | Detail specific pending to-do's in your research / allow others to request changes and/or highlight bugs. |
Encourage user contributions | • Contribution guidelines [@https://docs.github.com/github/building-a-strong-community/setting-guidelines-for-repository-contributors]: How to open issues / contribute code. • Github Wikis [@https://docs.github.com/github/building-a-strong-community/about-wikis]: More specific how-to guides. |
Provide as much information as you can to guide your users. You can also include administrator guidelines. |
Foster a respectful community | • Smaller projects: Contributor Covenant [@https://www.contributor-covenant.org]. • Larger projects: Citizen Code of Conduct [@https://github.com/stumpsyn/policies/blob/master/citizen_code_of_conduct.md]. |
Essential when you would like researchers to contribute code. |
Branch your repo sustainably | • Gitflow [@https://nvie.com/posts/a-successful-git-branching-model] | Useful when several developers contribute code to the project. Allows users to get access to stable versions of your research in an ongoing project. |
Keep track of your issues | • Kanban flowcharts [@https://www.atlassian.com/agile/kanban]: Github Projects [@https://github.com/features/project-management], GitKraken Boards [@https://www.gitkraken.com/boards]. • Scrum practices [@https://www.scrum.org/resources/what-is-scrum]: Zenhub [@https://www.zenhub.com], Jira [@https://www.atlassian.com/software/jira]. |
Keep track of your pending tasks in different projects with Agile [@https://agilemanifesto.org] software development practices. Especially useful if your research is split in many different repositories, each with multiple features/fixes to do. |
Automate your repo | • bump2version [@https://github.com/c4urself/bump2version]: Easier releasing. • Danger-CI [@https://danger.systems/ruby/]: Easier reviewing. |
Do less, script more! |
Table: Tools for making your research sustainable. {#tbl:community-tools-3}
Now that users know where to contact you, ensure you have developed contribution guidelines [@https://docs.github.com/github/building-a-strong-community/setting-guidelines-for-repository-contributors] (Table @tbl:community-tools-3), detailing how users should (1) open issues and (2) contribute with their own code changes via PRs. These guidelines are intended for new users/contributors, so should be written in the style of a how-to guide; however, they may also include additional instructions for the main developers or the administrator of the repository. Alternatively, the detailed guidelines can be included in a supplemental wiki, which hosting services offer as part of the repository [@https://docs.github.com/github/building-a-strong-community/about-wikis]. Equally important is a code of conduct (Table @tbl:community-tools-3), which includes expectations on how users should behave in the repository and consequences when someone does not comply, promoting a respectful community. Several code-of-conduct templates exist, such as the Contributor Covenant [@https://www.contributor-covenant.org] for smaller projects and the Citizen Code of Conduct [@https://github.com/stumpsyn/policies/blob/master/citizen_code_of_conduct.md] for larger projects.
Finally, consistent development and maintenance of your software as it grows in scope and number of users will ensure the sustainability of your project. Tools that aid in this include:
- Branching System: When many developers are involved in a project, more advanced branching methods, such as GitFlow [@https://nvie.com/posts/a-successful-git-branching-model], ensure that users can access functional versions of your code while you work on it. (Table @tbl:community-tools-3). Briefly, GitFlow includes two branches with an infinite lifetime: the main and the development (often named as devel). New branches will be based on the development branch, leaving the main one for stable versions of the code. Every time the development branch is merged into the main branch, a version release is created.
- Project Management: Tools exist to track, organize, and prioritize user issues (Table @tbl:community-tools-3), all based on Agile [@https://agilemanifesto.org] principles. The simplest approach is implementing a Kanban board [@https://www.atlassian.com/agile/kanban] (as found in GitHub Projects [@https://github.com/features/project-management] or GitKraken Boards [@https://www.gitkraken.com/boards]), where issues are organized in columns that clearly layout the current state of a given task. For larger projects comprising multiple collaborators and/or repositories, a more structured approach, such as a Scrum framework [@https://www.scrum.org/resources/what-is-scrum] (as implemented by Zenhub [@https://www.zenhub.com] and Jira [@https://www.atlassian.com/software/jira]), allows you to prioritize issues by setting milestones and estimating difficulties.
- Additional Automation: As your project develops, you will find that many aspects can be automated to improve efficiency. bump2version [@https://github.com/c4urself/bump2version] ensures all sections of your code get updated with the new release. Danger-CI [@https://danger.systems/ruby/] and git hooks [@https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks] ensure contributors comply with certain standards in their pull requests. If you are no longer actively maintaining a project, you can use CI (e.g. GitHub Actions [@https://github.com/features/actions]) to schedule regular tests to discover if your tool/code starts malfunctioning due to software rot and/or dependency issues. Finally, we advise against implementing too many automation tools at the start of a project, but adding them as needed. If you find yourself routinely performing a task, consider automating it.