Skip to content

Coronavirus annotation

Eric Nawrocki edited this page Aug 12, 2021 · 51 revisions

How to annotate SARS-CoV-2 sequences with VADR v1.3 or later version:

  1. Download and install the latest version of VADR, following the instructions on this page

  2. Download the latest SARS-CoV-2 vadr models (version 1.3-1, gzipped tarball) from here, unpack them (e.g. tar xfz <tarball.gz>). Note the path to the directory name created (<sarscov2-models-dir-path>) for step 3.

  3. Remove terminal ambiguous nucleotides from your input fasta sequence file using the fasta-trim-terminal-ambigs.pl script in $VADRSCRIPTSDIR/miniscripts/. The GenBank SARS-CoV-2 submission processing pipeline removes ambiguous nucleotides from the beginning and ends of sequences and also sequences that are less than 50nt or more than 30,000nt (after trimming) prior to running VADR, so to make sure your local VADR results are consistent with GenBank's VADR results, you should trim terminal ambiguous nucleotides first.

    WARNING: the fasta-trim-terminal-ambigs.pl script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.

    To remove terminal ambiguous nucleotides from your sequence file <input-fasta-file> and to remove short and long sequences to create a new trimmed file <trimmed-fasta-file>, execute:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 30000 <input-fasta-file> > <trimmed-fasta-file>
  1. Run the v-annotate.pl program on an input trimmed fasta file with SARS-CoV-2 sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).

    NOTE: The options below have changed for vadr 1.3. Also, the following command runs multithreaded on up to 8 CPUs, and so is only recommended if you have at least 8 CPUs and 16Gb RAM available. To run on <n> CPUs instead, replace --cpu 8 with --cpu <n>. To run single threaded on a single CPU remove the --cpu 8 option. The --split and --cpu options are incompatible with -p.

v-annotate.pl --split --cpu 8 --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --mdir <sarscov2-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
  1. (MOSTLY RELEVANT FOR ADVANCED USERS) OPTIONALLY map non-NC_045512 model coordinates in the output .alt.list file to NC_045512 coordinates using the vadr-map-model-coords.pl script in $VADRSCRIPTSDIR/miniscripts/. The output alt.list file includes information on all fatal alerts that cause sequences to fail along with the relevant model coordinates for those alerts. Some users may be interested in converting all non-NC_045512 model coordinates to NC_045512 coordinates to help with further analysis. To add an additional tab-delimited field with NC_045512 coordinates for each alert, execute:
$VADRSCRIPTSDIR/miniscripts/vadr-map-model-coords.pl <output-directory>/<output-alt-list-file> <sarscov2-model-dir-path>/sarscov2.mmap NC_045512

SARS-CoV-2 VADR models

As of April 13, 2021, the VADR model library used by GenBank for SARS-CoV-2 annotation (vadr-models-sarscov2-1.3-1 model library) includes four SARS-CoV-2 models: NC_045512, NC_045512-del28254, NC045512-MW422255 (B.1.1.7), and NC_045512-MW809059 (B.1.525). You can determine which model was used to annotate any given input sequence in the .sqa vadr output files described more here.

  1. NC_045512 model: based on the RefSeq NC_045512.2 sequence, has a length of 29903 nt.

  2. NC_045512-del28254 model: identical to the NC_045512 model except that the single nucleotide at position 28254 is deleted. This single nucleotide deletion affects the stop codon for the ORF8 CDS relative to the NC_045512 RefSeq, extending the length of ORF8 by four amino acids in the NC_045512-del28254 model relative to the NC_045512 RefSeq model. Length is 29902 nt.

  3. NC_045512-MW422255 model: aimed at facilitating submission of sequences from the B.1.1.7 lineage. The length of this model is 29884 nt. It is based on the MW422255.1 sequence but modified as follows:

    • extending the MW422255 sequence by 54 nt on the 5' end and 67 nt on the 3' end so the 5' and 3' ends match up with the NC_045512 RefSeq sequence, using the nucleotides from NC_045512
    • replacing the 8 N nucleotides with the corresponding nucleotide from NC_045512
  4. NC_045512-MW809059 model: aimed at facilitating submission of sequences from the lineage currently referred to as B.1.525. The length of this model is 29830 nt. It is based on the MW809059.1 sequence but modified as follows:

    • extending the MW809059 sequence by 2 nt on the 5' end and 50 nt on the 3' end so the 5' and 3' ends match up with the NC_045512 RefSeq sequence, using the nucleotides from NC_045512

The following table includes information on the 12 CDS features in the NC_045512 RefSeq and the two variant models:

CDS product gene NC_045512 model positions NC_045512-MW422255 model positions NC_045512-MW809059 model positions
ORF1ab polyprotein ORF1ab 266-13468,13468-21555 266-13459,13459-21546 266-13459,13459-21546
ORF1a polyprotein ORF1ab 266-13483 266-13474 266-13474
surface glycoprotein S 21563-25384 21554-25366 21554-25366
ORF3a protein ORF3a 25393-26220 25375-26202 25375-26202
envelope protein E 26245-26472 26227-26454 26227-26454
membrane glycoprotein M 26523-27191 26505-27173 26505-27173
ORF6 protein ORF6 27202-27387 27184-27369 27184-27366
ORF7a protein ORF7a 27394-27759 27376-27741 27373-27738
ORF7b ORF7b 27756-27887 27738-27869 27735-27866
ORF8 protein ORF8 27894-28259 27876-27956 27873-28238
nucleocapsid phosphoprotein N 28274-29533 28255-29514 28253-29509
ORF10 protein ORF10 29558-29674 29539-29655 29534-29650

The ORF8 protein CDS is italicized above because it is 285nt (95aa) shorter in the NC_045512-MW422255 model due to an earlier stop codon and it is annotated as truncated ORF8 protein in the output VADR feature table file in sequences classified to the NC_045512-MW422255 model.

To create a text file that shows how each position of one model maps to another model, you can use the vadr-map-two-complete-models.pl script in the $VADRSCRIPTSDIR/miniscripts directory. For example:

$VADRSCRIPTSDIR/miniscripts/vadr-map-two-complete-models.pl <sarscov2-models-dir-path>/sarscov2.mmap NC_045512-MW422255 NC_045512`

Many alerts/errors in ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10 do not cause a sequence to FAIL

As of August 5, 2021, many common alerts (e.g. early stop codons, frameshift mutations, etc.) in ORF3a, ORF6, ORF7a, ORF7b, ORF8, and ORF10 will no longer cause a sequence to fail VADR as they did previously. Instead such problems will cause that feature to not be annotated as a CDS but rather as a misc_feature, and no protein translation product and corresponding entry in the GenBank Protein database will be created. (Since February 2021, ORF8 was misc_featur-izable in this way, but ORF3a, ORF6, ORF7a, ORF7b, and ORF10 were not.)

For a list of what specific alerts for a CDS will cause the feature to be annotated as a misc_feature instead of failing the sequence, see the third column of this table or use the --alt_list option with v-annotate.pl.

A sequence will fail if 3 or more of the listed ORF features are annotated as misc_features with the nmiscftr alert (TOO_MANY_MISC_FEATURES). In the GenBank submission portal, this alert will trigger manual review by GenBank staff.

Because the alerts related to these problems no longer cause failure they will not be reported in the .alt.list or GenBank submission portal detailed error report .tsv file but information on them will still be output to the VADR .alt output file.


This section shows output from an example v-annotate.pl annotation of three SARS-CoV-2 sequences from GenBank using the same command and options that GenBank currently uses to screen incoming SARS-CoV-2 sequences.

The fasta file of those three sequences can be downloaded here.

(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)

For this example, the SARS-CoV-2 model directory is in /usr/local/vadr-models-sarscov2-1.3-1 and the pretrim.sars-cov2.4.fa sequence file is in the current directory. We will create a new file sars-cov2.4.fa with trimmed sequences in the next step.

As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 50nt or longer than 30,000nt with the command:

$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 50 --maxlen 30000 pretrim.sars-cov2.4.fa > sars-cov2.4.fa

Next, to annotate the trimmed sequences using the recommended v-annotate.pl options for SARS-CoV-2, run the following command (scroll to the right to see full command), which will create a new directory called my4 into which VADR output files will be written.

v-annotate.pl --split --cpu 8 --glsearch -s -r --nomisc --mkey sarscov2 --lowsim5seq 6 --lowsim3seq 6 --alt_fail lowscore,insertnn,deletinn --mdir /usr/local/vadr-models-sarscov2-1.3-1 sars-cov2.4.fa my4

The options used are explained further below.

When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.3 (Aug 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Aug  3 09:49:47 2021
# $VADRBIOEASELDIR:  /usr/local/vadr-install-1.3/Bio-Easel
# $VADRBLASTDIR:     /usr/local/vadr-install-1.3/ncbi-blast/bin
# $VADREASELDIR:     /usr/local/vadr-install-1.3/infernal/binaries
# $VADRINFERNALDIR:  /usr/local/vadr-install-1.3/infernal/binaries
# $VADRMODELDIR:     /usr/local/vadr-install-1.3/vadr-models-calici
# $VADRSCRIPTSDIR:   /usr/local/vadr-install-1.3/vadr
#
# sequence file:                                                                  sars-cov2.4.fa
# output directory:                                                               my4
# specify that alert codes in <s> cause FAILure:                                  lowscore,insertnn,deletinn [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  sarscov2 [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /usr/local/vadr-models-sarscov2-1.3-1 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature:    yes [--nomisc]
# lowsim5s/LOW_SIMILARITY_START minimum length is <n>:                            6 [--lowsim5seq]
# lowsim3s/LOW_SIMILARITY_END minimum length is <n>:                              6 [--lowsim3seq]
# use max length ungapped region from blastn to seed the alignment:               yes [-s]
# replace stretches of Ns with expected nts, where possible:                      yes [-r]
# split input file into chunks, run each chunk separately:                        yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch):            8 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Next, v-annotate.pl will output information as it proceeds through different steps of the analysis:

# Validating input                                                                        ... done. [    0.3 seconds]
# Splitting sequence file into chunks to run independently in parallel on 8 processors    ... done. [    0.9 seconds]
# Executing 4 scripts in parallel on 8 processors to process 4 partition(s) of all 4 sequence(s) ... 
#	   4 of    4 jobs finished (0.1 minutes spent waiting)
# done. [   12.4 seconds]
# Merging and finalizing output                                                           ... done. [    0.8 seconds]

With the --split --cpu 8 options, the input fasta script is split up into chunks, in this case placing one sequence per chunk but typically about 10 sequences per chunk for larger files, and runs v-annotate.pl separately on those chunks on 8 different CPUs in parallel. When all sequences are finished processing the main script merges the output together.

The v-annotate.pl output then includes a summary of the classification of sequences, and the alerts reported:

# Summary of classified sequences:
#
#                                           num   num   num
#idx  model               group         subgroup    seqs  pass  fail
#---  ------------------  ------------  ----------  ----  ----  ----
1     NC_045512           Sarbecovirus  SARS-CoV-2     3     2     1
2     NC_045512-MW422255  Sarbecovirus  SARS-CoV-2     1     0     1
#---  ------------------  ------------  ----------  ----  ----  ----
-     *all*               -             -              4     2     2
-     *none*              -             -              0     0     0
#---  ------------------  ------------  ----------  ----  ----  ----
#
# Summary of reported alerts:
#
#     alert     causes   short                            per    num   num  long
#idx  code      failure  description                     type  cases  seqs  description
#---  --------  -------  ---------------------------  -------  -----  ----  -----------
1     cdsstopn  yes*     CDS_HAS_STOP_CODON           feature      3     2  in-frame stop codon exists 5' of stop position predicted by homology to reference
2     cdsstopp  yes*     CDS_HAS_STOP_CODON           feature      3     2  stop codon in protein-based alignment
3     peptrans  yes*     PEPTIDE_TRANSLATION_PROBLEM  feature     26     1  mat_peptide may not be translated because its parent CDS has a problem
#---  --------  -------  ---------------------------  -------  -----  ----  -----------

And finally a list of the output files created:

# Output printed to screen saved in:                              my4.vadr.log
# List of executed commands saved in:                             my4.vadr.cmd
# List and description of all output files saved in:              my4.vadr.filelist
# esl-seqstat -a output for input fasta file saved in:            my4.vadr.seqstat
# 5 column feature table output for passing sequences saved in:   my4.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in:   my4.vadr.fail.tbl
# list of passing sequences saved in:                             my4.vadr.pass.list
# list of failing sequences saved in:                             my4.vadr.fail.list
# list of alerts in the feature tables saved in:                  my4.vadr.alt.list
# fasta file with passing sequences saved in:                     my4.vadr.pass.fa
# fasta file with failing sequences saved in:                     my4.vadr.fail.fa
# per-sequence tabular annotation summary file saved in:          my4.vadr.sqa
# per-sequence tabular classification summary file saved in:      my4.vadr.sqc
# per-feature tabular summary file saved in:                      my4.vadr.ftr
# per-model-segment tabular summary file saved in:                my4.vadr.sgm
# per-alert tabular summary file saved in:                        my4.vadr.alt
# alert count tabular summary file saved in:                      my4.vadr.alc
# per-model tabular summary file saved in:                        my4.vadr.mdl
# alignment doctoring tabular summary file saved in:              my4.vadr.dcr
# ungapped seed alignment summary file (-s) saved in:             my4.vadr.sda
# replaced stretches of Ns summary file (-r) saved in:            my4.vadr.rpn
#
# All output files created in directory ./my4/
#
# Elapsed time:  00:00:13.71
#                hh:mm:ss
# 
[ok]

Note that all the output files will be in the newly created my4 directory. The Summary of classified sequences listed that two sequences passed and two failed. The file my4.vadr.pass.list, lists the two sequences that passed:

MT159720.1
MT308693.1

and my4.vadr.fail.list lists the two sequences that failed:

MT159720.1/1406-G-to-T
MW422255.1/21610-T-to-A

Also, FASTA-formatted sequence files for each the passing and failing sequences are my4.vadr.pass.fa and my4.vadr.fail.fa.

For the two sequences that passed, the annotation is available in the output my4.vadr.pass.tbl file and for the two sequences that failed the annotation is in the my4.vadr.fail.tbl file.

my.vadr.pass.tbl: (with the middle of the table for each sequence removed for brevity)

>Feature MT159720.1
266	21555	gene
			gene	ORF1ab
266	13468	CDS
13468	21555
			product	ORF1ab polyprotein
			exception	ribosomal slippage
			protein_id	MT159720.1_1
266	13483	CDS
			product	ORF1a polyprotein
			protein_id	MT159720.1_2
266	805	mat_peptide
			product	leader protein
			protein_id	MT159720.1_1

...snip...

28274	29533	gene
			gene	N
28274	29533	CDS
			product	nucleocapsid phosphoprotein
			protein_id	MT159720.1_11
29558	29674	gene
			gene	ORF10
29558	29674	CDS
			product	ORF10 protein
			protein_id	MT159720.1_12
29609	29644	stem_loop
			note	Coronavirus 3' UTR pseudoknot stem-loop 1
29629	29657	stem_loop
			note	Coronavirus 3' UTR pseudoknot stem-loop 2
29728	29768	stem_loop
			note	Coronavirus 3' stem-loop II-like motif (s2m)
>Feature MT308693.1
217	21506	gene
			gene	ORF1ab
217	13419	CDS
13419	21506
			product	ORF1ab polyprotein
			exception	ribosomal slippage
			protein_id	MT308693.1_1
217	13434	CDS
			product	ORF1a polyprotein
			protein_id	MT308693.1_2
217	756	mat_peptide
			product	leader protein
			protein_id	MT308693.1_1

...snip...

28225	29484	gene
			gene	N
28225	29484	CDS
			product	nucleocapsid phosphoprotein
			protein_id	MT308693.1_11
29509	29625	gene
			gene	ORF10
29509	29625	CDS
			product	ORF10 protein
			protein_id	MT308693.1_12
29560	29595	stem_loop
			note	Coronavirus 3' UTR pseudoknot stem-loop 1
29580	29608	stem_loop
			note	Coronavirus 3' UTR pseudoknot stem-loop 2
29679	29719	stem_loop
			note	Coronavirus 3' stem-loop II-like motif (s2m)

And the second sequence in the my.vadr.fail.tbl (with the middle removed for brevity):

>Feature MW422255.1/21610-T-to-A
212	21492	gene
			gene	ORF1ab
212	13405	CDS
13405	21492
			product	ORF1ab polyprotein
			exception	ribosomal slippage
			protein_id	MW422255.1/21610-T-to-A_1
212	13420	CDS
			product	ORF1a polyprotein
			protein_id	MW422255.1/21610-T-to-A_2
212	751	mat_peptide
			product	leader protein
			protein_id	MW422255.1/21610-T-to-A_1

...snip...

29556	29584	stem_loop
			note	Coronavirus 3' UTR pseudoknot stem-loop 2
29655	29695	stem_loop
			note	Coronavirus 3' stem-loop II-like motif (s2m)

Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:surface glycoprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TAA, shifted S:3702,M:3702]; seq-coords:21608..21610:+; mdl-coords:21662..21664:+; mdl:NC_045512-MW422255;
ERROR: CDS_HAS_STOP_CODON: (CDS:surface glycoprotein) stop codon in protein-based alignment [-]; seq-coords:21608..21610:+; mdl-coords:21662..21664:+; mdl:NC_045512-MW422255;

Feature table format is described at https://www.ncbi.nlm.nih.gov/Sequin/table.html.

Note that the end of the fail.tbl file lists ERRORs for MW422255.1/21610-T-to-A, the fourth sequence in the input file, which is identical to GenBank sequence MW422255.1 with position 21610 changed from a T to a A to purposefully introduce an early stop codon in the surface glycoprotein (spike) for purposes of this example. Early stop codons are one reason a sequence can fail. Note the CDS_HAS_STOP_CODON errors at the end of the feature table.

The annotation information is also available in other files with different formats, such as the my4/my4.vadr.ftr file, which may be easier to parse for some applications.

How to interpret VADR alert/error messages and other output

For examples of alerts/errors and more information on how to interpret the VADR output related to those alerts in the output feature tables, .alt.list file and the GenBank submission portal detailed error report .tsv files, see this vadr documentation page.

For more information on how to interpret VADR results and additional files, including information on file formats, see the vadr documentation pages, linked to from here, or the VADR 1.0 paper (reference below).

Optionally converting non-NC_045512 model coordinates to NC_045512 coordinates in .alt.list and GenBank detailed error report .tsv output files

VADR uses four different SARS-CoV-2 models as explained above and reports which model is used to annotate each input sequence in various output files including the .alt.list file and the 'detailed error report .tsv' file created by the GenBank submission portal for any submission which includes at least one failure (for which the user selected not to auto-remove failing sequences). The alt.list and .tsv files include information on each fatal alert or error.

For example, the file my4/my4.vadr.alt.list created above is shown below (middle lines omitted for brevity), with model information in the model field. The .tsv file from the GenBank submission portial contains the identical fields and information as the .alt.list.

#sequence	model	feature-type	feature-name	error	seq-coords	mdl-coords	error-description
MT159720.1/1406-G-to-T	NC_045512	CDS	ORF1ab polyprotein	CDS_HAS_STOP_CODON	1406..1408:+	1406..1408:+	in-frame stop codon exists 5' of stop position predicted by homology to reference [TAG, shifted S:20147,M:20147]
MT159720.1/1406-G-to-T	NC_045512	CDS	ORF1ab polyprotein	CDS_HAS_STOP_CODON	1406..1408:+	1406..1408:+	stop codon in protein-based alignment [-]
MT159720.1/1406-G-to-T	NC_045512	CDS	ORF1a polyprotein	CDS_HAS_STOP_CODON	1406..1408:+	1406..1408:+	in-frame stop codon exists 5' of stop position predicted by homology to reference [TAG, shifted S:12075,M:12075]

...snip...


MT159720.1/1406-G-to-T	NC_045512	mat_peptide	2'-O-ribose methyltransferase	PEPTIDE_TRANSLATION_PROBLEM	-	-	mat_peptide may not be translated because its parent CDS has a problem [-]
MT159720.1/1406-G-to-T	NC_045512	mat_peptide	nsp11	PEPTIDE_TRANSLATION_PROBLEM	-	-	mat_peptide may not be translated because its parent CDS has a problem [-]
MW422255.1/21610-T-to-A	NC_045512-MW422255	CDS	surface glycoprotein	CDS_HAS_STOP_CODON	21608..21610:+	21662..21664:+	in-frame stop codon exists 5' of stop position predicted by homology to reference [TAA, shifted S:3702,M:3702]
MW422255.1/21610-T-to-A	NC_045512-MW422255	CDS	surface glycoprotein	CDS_HAS_STOP_CODON	21608..21610:+	21662..21664:+	stop codon in protein-based alignment [-]

The .alt.list and .tsv files also include the relevant model positions for each alert in the mdl-coords field. Note that the MW422255.1/21610-T-to-A sequence matched best to the NC_045512-MW422255 model. Some users who use only the NC_045512 RefSeq for any sequence analysis prior to VADR may want to convert non-NC_045512 model coordinates into the corresponding positions in the NC_045512 RefSeq. This can be done with the vadr-map-model-coords.pl that is included with VADR in the miniscripts/ directory.

You can run that script to add an additional tab-delimited field at the end of each line with NC_045512 model coordinates for each alert with the following command:

perl $GDIR/vadr/miniscripts/vadr-map-model-coords.pl my4/my4.vadr.alt.list <sarscov2-models-dir-path>/sarscov2.mmap NC_045512

For more explanation of the information in the .alt.list and .tsv file format including examples of many of the different types of alerts see this page.

Explanation of options used in example command

The options used in the above command are the recommended set of options for SARS-CoV-2 annotation because they are currently (as of August 5, 2021) being used by NCBI sequence indexers when they evaluate an incoming SARS-CoV-2 sequence submission. The options are each briefly explained in the table below. More information can be found here, and the -r and -s options are explained more below the table as well.

option explanation
--split split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
--cpu 8 for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 8 CPU workers (8 can be changed to <n1> with --cpu <n1>, 300Kb can be changed to <n> by adding option --nkb <n>), requires --split
--glsearch use the glsearch program from the FASTA package for the alignment stage, not cmalign, reducing required memory
-s turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment
-r turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible
--nomisc specify that features for failing sequences not be changed to misc_features in the output .tbl file
--mkey sarcov2 use the model files with prefix sarscov2 in the directory from --mdir
--lowsim5seq 6 set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 5' sequence end to 6 nt
--lowsim3seq 6 set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 3' sequence end to 6 nt
--alt_fail lowscore,insertnn,deletinn specify that lowscore alert and large insertions and deletions (insertnn,deletinn) cause a sequence to fail
--mdir /usr/local/vadr-models-sarscov2-1.3-1 specify that the models to use are in directory /usr/local/vadr-models-sarscov2-1.3-1

The -r option

The -r option replaces stretches of Ns in the input sequences with the expected nucleotides from the RefSeq. Be cautious, as this strategy actually replaces Ns in your sequence prior to determination of annotation. If this is not what you want to do, then do not use -r. Using -r is the current (as of May 7, 2020) practice for NCBI indexers analyzing incoming SARS-CoV-2 sequences, which is why it is included as a recommended option here. By doing this the Ns are assumed to be the 'expected' sequence in the corresponding regions for purposes of annotation. More information on -r, including information on other related options is here.

The second sequence in the sars-cov2.4.fa file includes 344 Ns, most of which are contained within consecutive stretches of 36 and 290 Ns in two regions, from sequence positions 11487-11522 and 19237-19526. With the -r option, v-annotate.pl replaces the Ns in these two stretches with the corresponding 'expected' nucleotides from the NC_045512 RefSeq, and then determines annotation with that doctored sequence. Some information on the Ns in each sequence including on those that were replaced is output to a file with suffix .rpn (format described here). For the example above, the my3.vadr.rpn file looks like this (scroll to right to see full file):

#seq  seq                        seq                            num_Ns  num_Ns  fract_Ns  ngaps  ngaps  ngaps    ngaps    ngaps      nnt      nnt  replaced_coords      
#idx  name                       len  model               p/f      tot      rp        rp    tot    int     rp  rp-full  rp-part  rp-full  rp-part  seq(S),mdl(M),#rp(N);
#---  -----------------------  -----  ------------------  ----  ------  ------  --------  -----  -----  -----  -------  -------  -------  -------  ---------------------
1     MT159720.1               29882  NC_045512           PASS       0       0         -      0      0      0        0        0        0        0  -                    
2     MT308693.1               29788  NC_045512           PASS     344     326     0.948      2      2      2        2        0      326        0  S:11487..11522,M:11536..11571,N:36/36;S:19237..19526,M:19286..19575,N:290/290;
3     MT159720.1/1406-G-to-T   29882  NC_045512           FAIL       0       0         -      0      0      0        0        0        0        0  -                    
4     MW422255.1/21610-T-to-A  29763  NC_045512-MW422255  FAIL       8       0     0.000      0      0      0        0        0        0        0  -                    

If we run the above command without -r, the Ns are not replaced and the MT308693.1 sequence fails because the N regions trigger alerts/errors to be reported. Specifically, the following three errors are reported in the fail.tbl file, amongst other PEPTIDE_TRANSLATION_PROBLEM errors for mature peptides (not shown):

ERROR: LOW_FEATURE_SIMILARITY: (gene:ORF1ab) region within annotated feature that does not match a CDS lacks significant similarity [36 nt overlap b/t low similarity region of length 36 (11487..11522) and annotated feature (217..21506)]; seq-coords:11487..11522:+; mdl-coords:11536..11571:+; mdl:NC_045512;
ERROR: LOW_FEATURE_SIMILARITY: (gene:ORF1ab) region within annotated feature that does not match a CDS lacks significant similarity [290 nt overlap b/t low similarity region of length 290 (19237..19526) and annotated feature (217..21506)]; seq-coords:19237..19526:+; mdl-coords:19286..19575:+; mdl:NC_045512;
ERROR: INDEFINITE_ANNOTATION_END: (CDS:ORF1ab polyprotein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [2271>8, valid stop codon in nucleotide-based prediction]; seq-coords:19236..21506:+; mdl-coords:21555..21555:+; mdl:NC_045512;

Note that the regions that are causing the LOW_FEATURE_SIMILARITY errors are stretches of Ns. Using -r allows some sequences like this that only fail due to the Ns to pass, and sometimes changes their annotation to be more accurate (based on the assumption that the replaced Ns represent the expected nucleotides at the corresponding positions).

The -s option

The -s option speeds up v-annotate.pl by identifying the maximum length ungapped alignment region using blastn, and fixing that part of the alignment and only using a more expensive alignment program (glsearch or cmalign) for the remainder of the sequence. This heuristic works particularly well for many SARS-CoV-2 sequences that are highly similar (~99% identical) to the RefSeq NC_045512. If we run the above example command without -s, it would require about 30 seconds per sequence instead of less than ten seconds per sequence. When -s is used, an additional output file with suffix .sda is output (format described here). More information on -s, including information on other related options is here.


The command above will only compare your input sequences to the four models described above (NC_045512, NC_045512-del28254, NC_045512-MW422255, and NC_045512-MW809059). If you want to annotate v-annotate.pl to perform the additional step of checking if each input sequence is more similar to SARS-CoV-2 or to a different Coronaviridae RefSeq, or if your input file contains non-SARS-CoV-2 Coronaviridae sequences, you can download a different model file that includes the four SARS-CoV-2 models and 54 other Coronaviridae RefSeqs (v1.3-1, gzipped tarball) from here. After downloading and unpacking that model set, change the --mkey option in the example command above from

--mkey sarscov2

to

--mkey corona

But be aware that changing this option will result in slower running times.

You can use VADR's v-build.pl program to build additional models of Coronaviridae GenBank sequences. More information is here.



Reference

  • The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3

Questions, comments, feature or model requests? Send a mail to eric.nawrocki@nih.gov.