Differential gene expression (DGE) analysis is commonly used in the field of functional genomics. Its main goal is to determine quantitative changes in gene expression levels between different experimental conditions or different populations. Nowadays, given the availability of NGS technologies, most DGE analyses are based on RNA-seq data, being the primary application of the technology [@doi:10.1038/s41576-019-0150-2]. Here, we will consider a study designed to gain insights regarding a specific condition of interest, leading to interest genes that can be functionally characterized in animal models afterward. The experimental setup included a control group and an experimental group with the condition of interest, both sequenced using RNA-seq. The experiment was conducted four times independently to get four replicates per group.
In this example, the most relevant part is personal research. The first step is to decide which programming languages to use. Considering that the bioinformatics analysis will include a first step performed in the command line, where raw reads will undergo quality control and the number of reads per transcripts will be counted, followed by a second step for data filtering and statistical analysis, the programming languages to use will be Shell and R. The folder structure must have separated spaces for raw data, results, documentation, and scripts used in the analysis. We recommend cloning the RR-init template into your HPC cluster. For this project, the best option will be to use a Conda environment with R installed, where you can download packages from Bioconductor and Bioconda. A shell script will be a suitable option for the first part of the analysis in the HPC cluster, and R Studio for the second part of analysis and visualization. We advise to follow literate programming, especially when writing R code, and to track changes using Git.
A sub-group of constraint-based models (Case Study 3) are genome-scale metabolic models, where the model represents the complete metabolism of a cell, inferred from genome sequencing. These models are significantly larger in size, making the model generation and curation steps hard to trace back if we don't use adequate tools. As a reference paper, consider the generation, curation, and validation of a genome-scale model for Parageobacillus thermoglucosidasius [@doi:10.1101/2021.02.01.429138], a thermophilic facultative anaerobic bacteria with promising traits for industrial metabolic engineering.
The first step in a project of this nature is to use one of the many reconstruction algorithms available [@doi:10.1186/s13059-019-1769-1] to start from what is referred to as a draft reconstruction; therefore, the choice of programming language for that section will depend on the selected algorithm. After that, there is a lengthy step of model curation and gap filling, in order to end up with a model that can produce all necessary building blocks of the cell. For this step, we recommend a basic setup of Python as programming language and Conda as environment manager, due to their ease of use and the growing number of Python packages being developed in the field. Additionally, we advise using Jupyter Notebook as the main working setup and Git for version control, as you can use different notebooks (or different versions of a notebook) as logs of analysis performed on your working draft. When collaborating, tools that will be especially useful while working on a genome-scale model are unit-testing, to ensure your model maintains a certain quality as you and others develop it [@doi:10.1038/s41587-020-0446-y], and ReviewNB, to keep track of changes in notebooks across commits and/or branches. Finally, when sharing the model within the community, Zenodo is a great option for defining different versions of the model, and any issue tracker will make it possible for users to pinpoint mathematical or biological inconsistencies in the network.