Skip to content

Latest commit

 

History

History
57 lines (32 loc) · 21 KB

04.collaboration.md

File metadata and controls

57 lines (32 loc) · 21 KB

Level 2: Collaboration

Collaboration is a key aspect of scientific research, but it is especially relevant in computational biology, where interdisciplinary knowledge is often needed. Although collaborators can have a wide range of involvement with your project, here we will consider individuals that share a direct relationship with you and your research. Each type of collaboration requires its own set of good practices, which will be covered in the next paragraphs.

2.1 Share code

Sharing code is one of the most common practices in software development, where large teams work together to develop complex functions and scripts. Although computational biology projects are usually not as big, proper sharing code is still essential. Hosting services, such as GitHub [@https://github.com], GitLab [@https://gitlab.com], and Bitbucket [@https://bitbucket.org] (Table @tbl:collaboration-tools), allow for a Git repository to be stored online by creating a copy of the repository known as the remote, which becomes the official version of the repository. The key advantage of using a remote is that there will be no direct interaction between different local copies of the repository, also known as clones; instead, each clone will interact with the remote exclusively, updating only if no conflicts between the two exist. This way, if a collaborator updates the remote repository, other collaborators will not be able to send their changes until they update their local copy.

Goal Tools
Share code Hosting services: GitHub [@https://github.com], GitLab [@https://gitlab.com], Bitbucket [@https://bitbucket.org].
Git branching strategies: GitHub flow [@https://guides.github.com/introduction/flow/].
Tests: correctness (e.g. pytest [@https://docs.pytest.org/en/stable/], testthat [@https://testthat.r-lib.org/]), style (e.g. flake8 [@https://flake8.pycqa.org/en/latest/]), vulnerabilities (e.g. Safety [@https://pyup.io/safety/]), coverage (e.g. codecov [@https://about.codecov.io/]).
• Continous integration: tox [@https://tox.readthedocs.io/en/latest/], Travis CI [@https://travis-ci.com/], Circle CI [@https://circleci.com/], Github Actions [@https://github.com/features/actions]).
Code reviews: Github [@https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/reviewing-changes-in-pull-requests], Crucible [@https://www.atlassian.com/software/crucible], Upsource [@https://www.jetbrains.com/upsource/].
Share data FAIR principles [@doi:10.1038/sdata.2016.18]: FAIRshake [@https://fairshake.cloud/].
Tidy data [@doi:10.18637/jss.v059.i10].
Data version control [@https://dvc.org/].
Share data science notebooks Static: **GitHub **[@https://github.com], **GitLab **[@https://gitlab.com], NBviewer [@https://nbviewer.jupyter.org/].
Interactive: Binder [@https://mybinder.org], Google CoLab [@https://colab.research.google.com/].
Comparative: nbdime [@https://nbdime.readthedocs.io/en/latest/], ReviewNB [@https://www.reviewnb.com/].
Share workflows General hosting services: GitHub [@https://github.com], GitLab [@https://gitlab.com], Bitbucket [@https://bitbucket.org]. **
Dedicated workflow repositories: Snakemake Workflow Catalog [@https://snakemake.github.io/snakemake-workflow-catalog/] , WorkflowHub [@https://workflowhub.eu/].
Share manuscripts General-purpose word processors: Google Docs [@https://www.google.com/docs/about/], Office 365 [@https://www.microsoft.com/en-us/microsoft-365].
Scholarly word processors: Authorea [@https://www.authorea.com/].
Online applications supporting Markup Languages: Overleaf (LaTeX) [@https://www.overleaf.com/], Manubot (Markdown + GitHub) [@https://manubot.org/].

Table: Tools for collaborative research. {#tbl:collaboration-tools}

To guarantee that different collaborators can work simultaneously in the same repository, it is best to implement a branching strategy in the repository (Table @tbl:collaboration-tools). In a small team, the most common strategy is to have a single main branch and generate branches from it that each different developer can work on. Then, whenever the developer is ready, they can request to combine—or merge—the changes from their branch into the main branch. This occurs via a process known as pull request (PR). Once a PR has been opened, collaborators can review, approve, and subsequently merge it into the main branch, preserving the commit history. This branching strategy is sometimes referred to as GitHub Flow [@https://guides.github.com/introduction/flow/] and will suffice for most projects. For more complex branching systems, see Level 3.

Using Git hosting services for collaboration has many additional benefits. The commit history both shows what was done at each point in time but also specifies the collaborator who made the changes; this allows users to take responsibility for their changes so that if, for example, a bug was introduced, commands such as git blame can pinpoint the cause. To ensure bugs can be easily tracked, descriptive commit messages that follow a standard are recommended [@https://sparkbox.com/foundry/semantic_commit_messages;@https://www.conventionalcommits.org/en/v1.0.0/]. Git hosting services can be accessed interactively online or from the terminal with tools such as GitHub CLI [@https://cli.github.com/]. Finally, Git hosting services also allow collaborators to open issues [@https://docs.github.com/en/github/managing-your-work-on-github/about-issues] for listing pending tasks and/or asking questions, acting as an open forum for development discussions, which has the advantage of remaining accessible for the future (as opposed to closed email discussions).

Another important concept to consider when developing code, especially with other collaborators, is to develop tests, meaning scripts that will run to find errors in the code (Table @tbl:collaboration-tools). Tests can be executed at different levels, from the individual units/components to the system/software as a whole [@https://www.geeksforgeeks.org/types-software-testing/]. Unit tests, in particular, are used to determine if specific modules/functions work as intended within the codebase so that if later the function grows in scope, its proper basic functioning is ensured. For instance, if a function was defined for adding numbers, a simple test would be to assess if the function outputs 13 when the inputs 6 and 7 are provided. Besides unit tests, computational biology projects can benefit from implementing integration tests to evaluate the correct interaction between different modules and smoke tests to indicate if any core functionality has been impacted. Test runners, such as pytest [@https://docs.pytest.org/en/stable/] for Python and testthat [@https://testthat.r-lib.org/] for R, exist to facilitate incorporating tests to the codebase. It is good practice to develop tests at the same time you develop code, as adding tests a posteriori is significantly harder. It is an even better practice to test every single step of the code (from data loading to figure plotting), a concept known in software development as end-to-end testing [@https://smartbear.com/solutions/end-to-end-testing/].

Going beyond testing correctness, flake8 [@https://flake8.pycqa.org/en/latest/] will test styling preferences (for complying with PEP8), Safety [@https://pyup.io/safety/] will test for vulnerabilities among the software's dependencies, and Codecov [@https://about.codecov.io/] will test coverage, or the percentage of the codebase tested. As a rule of thumb for testing coverage, the more lines of code tested, the more reliable the software will be. Different types of tests can be funneled into a single testing pipeline—in a process known as continuous integration (CI)—that can be tuned to run locally whenever commits are made, or online whenever a pull request is opened and/or merged. When running locally, an environment manager/command-line tool, such as tox [@https://tox.readthedocs.io/en/latest/], can help to ensure all tests are executed under different Python versions. Different tools, such as Travis CI [@https://travis-ci.com/] or Circle CI [@https://circleci.com/], can be used to set up the CI cycle online. More recently, GitHub Actions [@https://github.com/features/actions] was developed to run integrations directly from GitHub.

Having tests is a great way to ensure that code fulfills a certain level of correctness and styling. However, it is no replacement for human assessment to determine if the code is correct, necessary, and useful. Therefore, peer code review is essential whenever developing code in collaboration (Table @tbl:collaboration-tools). While tools, such as Crucible [@https://www.atlassian.com/software/crucible] and Upsource [@https://www.jetbrains.com/upsource/], exist for making in-line reviews of each file, the most common approach is for you and/or others to directly review the code using the online review tools provided by various hosting services. In the case of GitHub [@https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/reviewing-changes-in-pull-requests], this not only allows the reviewer to open a comment in any line of the code, which creates a thread for the original author to reply but also to suggest changes that can be approved or dismissed. Reviewers can assess many features of the code, from functionality to documentation, while also following good practices, such as using constructive phrasing, which is outside of the scope of this review but presented in detail elsewhere [@https://google.github.io/eng-practices/review/reviewer;@https://phauer.com/2018/code-review-guidelines].

2.2 Share data

The practices of sharing data are similar to sharing code: we should store our datasets, and any changes to them, in a repository and ensure it complies with standards by testing its quality. However, since data has a more consistent structure than code, often existing in standard formats, we should consider additional criteria when sharing it with collaborators (and later with the community). The main set of guidelines that represent these criteria was outlined in what is known as the FAIR principles [@doi:10.1038/sdata.2016.18]: data should be Findable (easy to locate online); Accessible (easy to access once found); Interoperable (easy to integrate with other data/applications/workflows/etc); and Reusable (presented in a way that allows for others to use it for the same or different purposes). Tools like FAIRshake [@https://fairshake.cloud/] can be used to determine if data fits FAIR criteria.

For making data findable, research repositories such as Zenodo [@https://zenodo.org] and Figshare [@https://figshare.com/about] allow you to assign a digital object identifier (DOI) to any group of files you upload, including data and/or code. Alternatively, regular code repositories like GitHub can be used instead, as you can employ commits and/or releases to identify specific versions of the data, in combination with extensions for Large File Storage (LFS), such as git LFS [@https://git-lfs.github.com/], in the case of data files larger than 100 MB [@https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota]. GitHub can also integrate with Zenodo to automatically archive repositories and assign them a DOI. A final alternative is the Data Version Control (DVC) initiative [@https://dvc.org/], which is especially useful when performing machine learning, as it can keep track of data, machine learning models, and even scoring metrics.

For making data accessible, we encourage as much as possible to make your repositories open access. In cases in which you or your collaborators prefer some restrictions, you can create guest accounts to provide access to private repositories. For making data interoperable, distinctions between raw and clean data have been made [@doi:10.1371/journal.pcbi.1000424], with raw data being the files that came out of the measuring device, and clean data representing the files that are ready to be used for any computational analysis. An important characteristic that clean data should have is to be tidy, which is reviewed in detail elsewhere [@doi:10.18637/jss.v059.i10]. Finally, for making data reusable, thorough documentation of the data is required, including experimental design, measurement units, and possible sources of error.

2.3 Share data science notebooks

Jupyter Notebooks have become a fundamental tool for data analysis, which can be shared with collaborators using either static or interactive options. The former shares computational notebooks as rendered text, written internally in HTML. Static notebooks are a good option when you want to avoid any modifications and can work as an archive of past analyses, although interacting with its content is cumbersome—the file must be downloaded and run in a local Jupyter installation. Git-based code repositories, such as GitHub [@https://github.com] and GitLab [@https://gitlab.com], automatically render notebooks that can be later shared using the repository's URL. To facilitate this process, Project Jupyter provides a web application called NBviewer [@https://nbviewer.org], where you can paste a Jupyter Notebook's URL, publicly hosted in GitHub or elsewhere, and renders the file into a static HTML web page with a stable link.

Interactive notebooks, on the other hand, not only render the file but also allow collaborators to fully interact with it, tinkering with parameters or trying new input data—no installation required. Binder Project [@https://mybinder.org] enables users to fully interact with any notebook within a publicly-hosted Git-based repository via a Jupyter Notebook interface, although changes will not be saved to the original file. The platform supports Python and R, among other languages, and additional packages required to run the analysis need to be specified in a configuration file within the repository. Similarly, Jupyter Notebooks can be run interactively using Google CoLab [@https://colab.research.google.com/] by anyone with a Google account. Notebooks can be updated locally, from any public GitHub repository, or from Google Drive. As an added bonus, Google CoLab notebooks can be edited by multiple developers in real-time. In both cases, the machines provided by these services are comparable to a modern laptop, hence these tools may not be suitable for computing-intensive tasks.

Notebooks should be treated like any other piece of code: updates from different collaborators should be managed with version control in a platform such as GitHub. The problem, however, is that git and other version control systems use line-based differences that are not very well suited for the internal JSON representation of Jupyter notebooks. The extension nbdime [@https://nbdime.readthedocs.io/en/latest/] can be installed locally to enable content-aware diffing and merging. Additionally, NBreview [@https://www.reviewnb.com/] can be integrated with GitHub to enable content-aware diffing, displaying the old and new versions of a notebook in parallel to facilitate code review.

2.4 Share computational workflows

Computational biology projects often demand using multi-step analyses with dozens of third-party software and dependencies. Although these steps can be described in the documentation, complex workflows are better shared as stand-alone code that can be easily run with minimal file manipulation from collaborators. Doing so can safeguard the reproducibility and replicability of the analysis, leading to better science and fewer challenges downstream.

The simplest way to share a pipeline is through a shell script that receives input files via the command line, allowing flexibility to run analyses with different input data; however, shell scripts offer little control over the overall workflow and cannot re-run specific parts of the pipeline. To address these issues, pipelines are better shared using a workflow automation system. Theoretically, all of the instructions regarding the workflow could be written in the main pipeline file: in Snakemake, this would be the .smk file (or Snakefile); in Nextflow, te .nf file; in CWL, the .cwl file; and in WDL, the .wdl. However, to ensure reproducibility, it is a good practice to share complete pipelines, meaning folder structures, additional files, and software specifications, as well as custom scripts developed for the analysis. These files can be shared using the same tools as code, namely GitHub or any other Git hosting service. Alternatively, they can be uploaded to hosting services specialized in workflows, like Snakemake Workflow Catalog [@https://snakemake.github.io/snakemake-workflow-catalog/] or WorkflowHub [@https://workflowhub.eu/], currently in beta.

When sharing workflows, consider that sharing software versioning is necessary for your collaborators to reproduce your pipeline using their own computing setup. Conda environments, for example, can be easily created from an environment file (in YAML language), which can be exported from an existing environment. Notably, Snakemake and Nexflow can be configured to automatically build isolated environments for each rule or step, enabling the running of different versions of a program within the same pipeline, which is especially helpful when using both Python 2 and 3 in the same pipeline, for example. In addition to sharing the specifications of an environment, it is possible to share the environment itself via containers, which we will discuss in Level 3.

2.5 Write manuscripts collaboratively

Writing articles is the primary way we share our research with the scientific community at large. However, writing manuscripts collaboratively comes with its challenges when using classical word processing tools, often resulting in files with different names, jumping from one email inbox to another, and contradictory final versions. The tools we suggest will help to avoid these issues. Companies have become aware of the need for collaborative writing, developing online applications that can be simultaneously edited by multiple people. Google Docs [@https://www.google.com/docs/about/] and Microsoft Office 365 [@https://www.microsoft.com/en-us/microsoft-365] are well-known word processors designed for this purpose, with text displayed as it would appear as a printout (known as What-You-See-Is-What-You-Get , or WYSIWYG) and formatting performed using internal features of the application. These platforms are extremely user-friendly and require no specialized knowledge making them a good option when collaborators seek simplicity. Although these applications are not specifically tailored for scientific writing, third-party companies have developed plugins enabling useful features, such as adding scientific references to your document (e.g., Paperpile and Zotero). Companies like Authorea [@https://www.authorea.com/] have developed online applications specifically designed for writing manuscripts that offer templates for different types of research projects and allow easy reference additions using identifiers (DOI, PubMed, etc.).

In addition to word processors, text editors are a viable option to write manuscripts when combined with a markup language—a human-readable computer language that uses tags to delineate formatting elements in a document that will be later rendered. Since the formatting process is internally handled by the application, styling elements (e.g., headers, text formatting, and equations) are easily written in text, achieving greater consistency than word processors. Disciplines closely related to computational biology, such as statistics and mathematics, have historically used the markup language LaTeX for writing articles. This language has simple and specific syntax for mathematical constructs making it a popular choice for papers with many equations. To aid collaborative writing, platforms like Overleaf [@https://www.overleaf.com/] provide online LaTeX editors, supporting features like real-time editing. In addition to LaTeX, an emerging trend in collaborative writing uses the lightweight markup language Markdown within the GitHub infrastructure. The software Manubot [@https://manubot.org/] provides a set of functionalities to write scholarly articles within a GitHub repository, leveraging all the advantages of Git version control and the GitHub hosting platform [@doi:10.1371/journal.pcbi.1007128]. For example, it provides cloud storage and version control. The GitHub user interface also allows offline manuscript discussions using issues and task assignments (see Level 3 for tips on project management). Manubot, in particular, accepts citations using manuscript identifiers and automatically renders the article in PDF, HTML, and Word .doc formats. As a drawback, it requires technical expertise in Git and familiarity with GitHub; as an upside, its reliable infrastructure scales well to large and open collaborative projects. The document you are reading now was fully written using Manubot!