Nextstrain analyses with hCoV-19 sequences in the SRA.

What is this workflow?

Labs around the world are sequencing coronavirus (hCoV-19) to monitor how it spreads and evolves. However, most sequences are not immediately publicly availible. Here we provide a workflow to identfy publicly available hCoV-19 sequences that have been uploaded to the NCBI SRA database. We map SRA reads to the hCoV-19 Whuan reference genome, identify variable sites, create a consensus fasta sequence, and analyse these sequences with Nextstrain.

Workflow to analyze publicly available hCoV-19 SRA reads with Nextstrain

Identify hCoV-19 sequences

using the NCBI SRA Taxonomy Analysis Tool (STAT)
using whole genome assemblies here

Extract metadata to match Nexstrain input
Download SRA data
Run Nextstrain for analyses and visualization

1. Identify hCoV-19 sequences

The NCBI SRA Taxonomy Analysis Tool (STAT) can identify SRA runs with kmers matching hCoV-19. We identified runs containing >1000 matching kmers. Alternatively, there are full genome assemblies available for hCoV-19 that can be used. Assemblies can be run without generating alignments and consensus sequences (see below).

2. Extract metadata to match Nexstrain input

We created a snakemake that will create metadata.tsv containing metadata for SRA runs identified with STAT.

3. Download SRA data and generate consensus sequences for each run

The snakemake will also generate sequences.fasta by

downloading SRA reads
mapping them to the Wuhan reference genome (hisat2)
identifying variable sites
generate consensus sequences (bcftools)

4. Run Nextstrain for analyses and visualization

Finally, we add metadata.tsv sequences.fasta and run snakemake -p to produce Nextstrain results that can be viewed with

auspice view --datasetDir auspice

Team

Our fearless leader: Vadim Zalunin (vadimzalunin)
Alison Schaefer (amaiellu)
Joe McGirr (joemcgirr)

Forthcoming features

We are automating the process of wrangling metadata for full genome assemblies with ncov_assemblies.sh

Dependencies

See Dockerfile

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
workflow		workflow
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
ncov_assemblies.sh		ncov_assemblies.sh
workflow-01.png		workflow-01.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nextstrain analyses with hCoV-19 sequences in the SRA.

What is this workflow?

Workflow to analyze publicly available hCoV-19 SRA reads with Nextstrain

1. Identify hCoV-19 sequences

2. Extract metadata to match Nexstrain input

3. Download SRA data and generate consensus sequences for each run

4. Run Nextstrain for analyses and visualization

Team

Forthcoming features

Dependencies

About

Releases

Packages

Contributors 2

Languages

License

NCBI-Codeathons/Automating-tools-to-search-and-analyze-large-genome-sequence-repositories

Folders and files

Latest commit

History

Repository files navigation

Nextstrain analyses with hCoV-19 sequences in the SRA.

What is this workflow?

Workflow to analyze publicly available hCoV-19 SRA reads with Nextstrain

1. Identify hCoV-19 sequences

2. Extract metadata to match Nexstrain input

3. Download SRA data and generate consensus sequences for each run

4. Run Nextstrain for analyses and visualization

Team

Forthcoming features

Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages