Labs around the world are sequencing coronavirus (hCoV-19) to monitor how it spreads and evolves. However, most sequences are not immediately publicly availible. Here we provide a workflow to identfy publicly available hCoV-19 sequences that have been uploaded to the NCBI SRA database. We map SRA reads to the hCoV-19 Whuan reference genome, identify variable sites, create a consensus fasta sequence, and analyse these sequences with Nextstrain.
- Identify hCoV-19 sequences
- using the NCBI SRA Taxonomy Analysis Tool (STAT)
- using whole genome assemblies here
- Extract metadata to match Nexstrain input
- Download SRA data
- Run Nextstrain for analyses and visualization
The NCBI SRA Taxonomy Analysis Tool (STAT) can identify SRA runs with kmers matching hCoV-19. We identified runs containing >1000 matching kmers. Alternatively, there are full genome assemblies available for hCoV-19 that can be used. Assemblies can be run without generating alignments and consensus sequences (see below).
We created a snakemake that will create metadata.tsv
containing metadata for SRA runs identified with STAT.
The snakemake will also generate sequences.fasta
by
- downloading SRA reads
- mapping them to the Wuhan reference genome (hisat2)
- identifying variable sites
- generate consensus sequences (bcftools)
Finally, we add metadata.tsv
sequences.fasta
and run snakemake -p to produce Nextstrain results that can be viewed with
auspice view --datasetDir auspice
- Our fearless leader: Vadim Zalunin (vadimzalunin)
- Alison Schaefer (amaiellu)
- Joe McGirr (joemcgirr)
We are automating the process of wrangling metadata for full genome assemblies with ncov_assemblies.sh
See Dockerfile