Assembling the SARS-CoV-2 Genome

The main purpose of this repository is to demonstrate the steps involved in de novo assembly of a viral genome, namely SARS-CoV-2.

Sequencing reads used for the pipeline were retrieved from accession SRR10971381 as provided by Wu et al. 2020. The steps involved in the assembly pipeline are listed as follows:

Data retrieval
Exploration of sequence properties
Trimming of low-scoring bases
Quality contol
Genome assembly
Contig visualization

Each task in the pipeline can be executed as a separate bash script. The order of execution is orchestrated using GNU Make, as provided by a Makefile in the root directory. A custom conda environment is used to manage dependencies.

Prerequisites

The first step in reproducing the analysis is to clone the repository in your local machine:

# Locally clone repository
git clone https://github.com/dagsdags212/learn-genome-assembly.git

# Change to the project directory
cd learn-genome-assembly

Next, install micromamba as instructed here and configure with:

micromamba config append channels bioconda     # source of many bioinformatics-specific tools
micromamba config append channels conda-forge  # for other python packages such as numpy and pandas
micromamba config set channel_priority strict

Then create a new environment from the env.yml file:

micromamba create -f env.yml

We will be using fastq-dump for retrieving sequencing reads from the Sequence Read Archive. Unfortunately, the bioconda channel does not always provide the latest version of the toolkit. Using an outdated version may result in errors. This issue is easily solved by downloading the toolkit from the official source.

Download the latest version of the SRA toolkit (here)[https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit] and copy the binaries to the environment:

# Define the default environment file.
ENV=${MAMBA_ROOT_PREFIX}/envs/assembly

# Download sratoolkit from source.
wget --output-document sratoolkit.tar.gz -t 3 https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-ubuntu64.tar.gz

# Uncompress tarfile.
tar xzvf sratoolkit.tar.gz

# Copy executables to the bin directory of the environment.
cp -r sratoolkit*/bin/* ${ENV}/bin

# Delete the original download.
rm -rf sratoolkit*

As of writing, the current version of fastq-dump is 3.1.1. Verify that you have the same version by running:

fastq-dump --version

Activate the environment with:

micromamba activate assembly

Running the Pipeline

Explore available commands by running:

make usage

# or simply
make

Run the assembly pipeline by invoking the following command:

make assemble

Fetching data from the SRA can take a long time depending on your network speed. Additionally, the assembly process make take up a significant portion of your machine's computing resources. To run the pipeline on a smaller subset of the SRA data, pass in the number of reads using the N parameter. You can also specify the amount of cores to be alloted for the assembly through the CPU parameter.

# Run assembly with 100000 reads and 8 cores
make assemble N=100000 CPU=8

After identifying the id of the longest generated contig, run a BLAST query to identify similar sequences and their respective hosts:

make blast

Run the entire pipeline using:

make all

# Do some cleanup afterwards to save disk space
make clean

Other commands include:

# Gather sequence statistics
make stats

# Consolidate fastQC reports into a interactive summary file.
make multiqc

# Visualize assembled contigs using Bandage
make visualize

For any questions or suggestions, feel free to contact me at jegsamson.dev@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
metadata		metadata
scripts		scripts
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assembling the SARS-CoV-2 Genome

Prerequisites

Running the Pipeline

About

Releases

Packages

Languages

dagsdags212/learn-genome-assembly

Folders and files

Latest commit

History

Repository files navigation

Assembling the SARS-CoV-2 Genome

Prerequisites

Running the Pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages