Skip to content

NCBI-Codeathons/Automating-tools-to-search-and-analyze-large-genome-sequence-repositories

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nextstrain analyses with hCoV-19 sequences in the SRA.

What is this workflow?

Labs around the world are sequencing coronavirus (hCoV-19) to monitor how it spreads and evolves. However, most sequences are not immediately publicly availible. Here we provide a workflow to identfy publicly available hCoV-19 sequences that have been uploaded to the NCBI SRA database. We map SRA reads to the hCoV-19 Whuan reference genome, identify variable sites, create a consensus fasta sequence, and analyse these sequences with Nextstrain.

Workflow to analyze publicly available hCoV-19 SRA reads with Nextstrain

  1. Identify hCoV-19 sequences
  • using the NCBI SRA Taxonomy Analysis Tool (STAT)
  • using whole genome assemblies here
  1. Extract metadata to match Nexstrain input
  2. Download SRA data
  3. Run Nextstrain for analyses and visualization

alt text

1. Identify hCoV-19 sequences

The NCBI SRA Taxonomy Analysis Tool (STAT) can identify SRA runs with kmers matching hCoV-19. We identified runs containing >1000 matching kmers. Alternatively, there are full genome assemblies available for hCoV-19 that can be used. Assemblies can be run without generating alignments and consensus sequences (see below).

2. Extract metadata to match Nexstrain input

We created a snakemake that will create metadata.tsv containing metadata for SRA runs identified with STAT.

3. Download SRA data and generate consensus sequences for each run

The snakemake will also generate sequences.fasta by

  • downloading SRA reads
  • mapping them to the Wuhan reference genome (hisat2)
  • identifying variable sites
  • generate consensus sequences (bcftools)

4. Run Nextstrain for analyses and visualization

Finally, we add metadata.tsv sequences.fasta and run snakemake -p to produce Nextstrain results that can be viewed with

auspice view --datasetDir auspice

Team

  • Our fearless leader: Vadim Zalunin (vadimzalunin)
  • Alison Schaefer (amaiellu)
  • Joe McGirr (joemcgirr)

Forthcoming features

We are automating the process of wrangling metadata for full genome assemblies with ncov_assemblies.sh

Dependencies

See Dockerfile

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published