-
Notifications
You must be signed in to change notification settings - Fork 27
Coronavirus annotation
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
-
Download and install the latest version of VADR, following the instructions on this page
-
Download the latest coronavirus vadr models (gzipped tarball) from this FTP page, unpack them (e.g.
tar xfz <tarball.gz>
). Note the path to the directory name created (<sarscov2-models-dir-path>
) for step 3. -
Run the
v-annotate.pl
program on an input fasta file with SARS-CoV-2 sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).NOTE: With 1.2 SARS-CoV-2 annotation using VADR now requires 2Gb of RAM per thread, down from 64Gb RAM. This command runs multithreaded on 8 CPUs, and so is only recommended if you have 8 CPUs and 16Gb RAM available. To run on
<n>
CPUs instead, replace--cpu 8
with--cpu <n>
. To run single threaded on a single CPU remove the--cpu 8
option.
v-annotate.pl --cpu 8 --split -s -r --nomisc --mkey sarscov2 --lowsim5term 2 --lowsim3term 2 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf,insertnn,deletinn --mdir <sarscov2-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
As of April 13, 2021, the VADR model library used by GenBank for
SARS-CoV-2 annotation (vadr-models-corona-1.2-2
model
library)
includes four SARS-CoV-2 models: NC_045512
, NC_045512-del28254
,
NC045512-MW422255
(B.1.1.7), and NC_045512-MW809059
(B.1.525). You
can determine which model was used to annotate any given input
sequence in the .sqa
vadr output files described more
here.
The NC_045512
model has a length of 29903 nt and is based on the
RefSeq NC_045512.2
sequence.
The NC_045512-del28254
model is identical to the NC_045512
model
except that the single nucleotide at position 28254 is deleted. This
single nucleotide deletion affects the stop codon for the ORF8 CDS
relative to the NC_045512
RefSeq, extending the length of ORF8 by
four amino acids in the NC_045512-del28254
model relative to the
NC_045512
RefSeq model.
The NC_045512-MW422255
model has a length of 29884 nt and is based
on the MW422255.1
sequence but modified as follows:
-
extending the MW422255 sequence by 54 nt on the 5' end and 67 nt on the 3' end so the 5' and 3' ends match up with the NC_045512 RefSeq sequence, using the nucleotides from NC_045512
-
replacing the 8 N nucleotides with the corresponding nucleotide from NC_045512
The addition of NC_045512-MW422255 is aimed at facilitating submission of sequences from lineages currently referred to as B.1.1.7, 501Y.V1 or 20I.
The NC_045512-MW809059
model has a length of 29830 nt and is based
on the MW809059.1
sequence but modified as follows:
- extending the MW809059 sequence by 2 nt on the 5' end and 50 nt on the 3' end so the 5' and 3' ends match up with the NC_045512 RefSeq sequence, using the nucleotides from NC_045512
The addition of NC_045512-MW422255 is aimed at facilitating submission of sequences from the lineage currently referred to as B.1.525.
The following table includes information on the 12 CDS features in the NC_045512 RefSeq and the two variant models:
CDS product | gene |
NC_045512 model positions |
NC_045512-MW422255 model positions |
NC_045512-MW809059 model positions |
---|---|---|---|---|
ORF1ab polyprotein | ORF1ab | 266-13468,13468-21555 | 266-13459,13459-21546 | 266-13459,13459-21546 |
ORF1a polyprotein | ORF1ab | 266-13483 | 266-13474 | 266-13474 |
surface glycoprotein | S | 21563-25384 | 21554-25366 | 21554-25366 |
ORF3a protein | ORF3a | 25393-26220 | 25375-26202 | 25375-26202 |
envelope protein | E | 26245-26472 | 26227-26454 | 26227-26454 |
membrane glycoprotein | M | 26523-27191 | 26505-27173 | 26505-27173 |
ORF6 protein | ORF6 | 27202-27387 | 27184-27369 | 27184-27366 |
ORF7a protein | ORF7a | 27394-27759 | 27376-27741 | 27373-27738 |
ORF7b | ORF7b | 27756-27887 | 27738-27869 | 27735-27866 |
ORF8 protein | ORF8 | 27894-28259 | 27876-27956 | 27873-28238 |
nucleocapsid phosphoprotein | N | 28274-29533 | 28255-29514 | 28253-29509 |
ORF10 protein | ORF10 | 29558-29674 | 29539-29655 | 29534-29650 |
The ORF8 protein CDS is italicized above because it
is 285nt (95aa) shorter in the NC_045512-MW422255 model due to an
earlier stop codon and it is annotated as truncated ORF8 protein
in
the output VADR feature table file in sequences classified to the
NC_045512-MW422255 model.
This section shows output from an example v-annotate.pl
annotation of three
SARS-CoV-2 sequences from GenBank using the same command and options that
GenBank currently uses to screen incoming SARS-CoV-2 sequences.
The fasta file of those three sequences can be downloaded here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the coronavirus model directory is in /usr/local/vadr-models-corona-1.2-2
and the sars-cov2.3.fa
sequence file is in the current directory, and we will create a new directory called my3
into which output files will be written.
To annotate these sequences using the recommended v-annotate.pl
options for SARS-CoV-2, run the command (scroll to right to see full command):
v-annotate.pl --cpu 8 --split -s -r --nomisc --mkey sarscov2 --lowsim5term 2 --lowsim3term 2 --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf,insertnn,deletinn --mdir /usr/local/vadr-models-corona-1.1.3-1 sars-cov2.3.fa my3
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.2 (April 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Fri Apr 9 16:19:15 2021
# $VADRBIOEASELDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/Bio-Easel
# $VADRBLASTDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/ncbi-blast/bin
# $VADREASELDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/infernal/binaries
# $VADRINFERNALDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/infernal/binaries
# $VADRMODELDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/vadr-models-calici
# $VADRSCRIPTSDIR: /panfs/pan1/infernal/notebook/21_0223_vadr_1p2_release/test-installs/vadr-install-dir/vadr
#
# sequence file: sars-cov2.3.fa
# output directory: my3
# specify that alert codes in <s> cause FAILure: lowscore,fsthicnf,fstlocnf,insertnn,deletinn [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': sarscov2 [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /panfs/pan1/dnaorg/virseqannot/code/vadr-models-sarscov2-1.2-2dev2 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature: yes [--nomisc]
# lowsim5{s,f}/LOW_{FEATURE_}SIMILARITY_START minimum length is <n>: 2 [--lowsim5term]
# lowsim3{s,f}/LOW_{FEATURE_}SIMILARITY_END minimum length is <n>: 2 [--lowsim3term]
# fstlocnf/POSSIBLE_FRAMESHIFT_LOW_CONF minimum average probability for alert is <x>: 0.0 [--fstlowthr]
# use max length ungapped region from blastn to seed the alignment: yes [-s]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# split input file into chunks, run each chunk separately: yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch): 8 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl
will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.3 seconds]
# Splitting sequence file into chunks to run independently in parallel on 8 processors ... done. [ 0.0 seconds]
# Executing 1 script to process 1 partition(s) of all 3 sequence(s) ... done. [ 13.1 seconds]
# Merging and finalizing output ... done. [ 1.1 seconds]
With the --split --cpu 8
options, the input fasta script is split up
into chunks of about 10 full length SARS-CoV-2 each and runs
v-annotate.pl
separately on those chunks on 8 different CPUs in
parallel. When all sequences are finished processing the main script
merges the output together. In this example, there are only 3
sequences so there is only 1 chunk run on 1 CPU, but with larger input
sequence files, there will be more chunks run in parallel.
The v-annotate.pl
output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- ------------ ---------- ---- ---- ----
1 NC_045512 Sarbecovirus SARS-CoV-2 3 2 1
#--- --------- ------------ ---------- ---- ---- ----
- *all* - - 3 2 1
- *none* - - 0 0 0
#--- --------- ------------ ---------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- --------------------------- ------- ----- ---- -----------
1 cdsstopn yes* CDS_HAS_STOP_CODON feature 2 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
2 cdsstopp yes* CDS_HAS_STOP_CODON feature 2 1 stop codon in protein-based alignment
3 peptrans yes* PEPTIDE_TRANSLATION_PROBLEM feature 26 1 mat_peptide may not be translated because its parent CDS has a problem
#--- -------- ------- --------------------------- ------- ----- ---- -----------
And finally a long list of the output files created (truncated for brevity):
# Output printed to screen saved in: my3.vadr.log
# List of executed commands saved in: my3.vadr.cmd
# List and description of all output files saved in: my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in: my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in: my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: my3.vadr.fail.tbl
# list of passing sequences saved in: my3.vadr.pass.list
# list of failing sequences saved in: my3.vadr.fail.list
# list of alerts in the feature tables saved in: my3.vadr.alt.list
# fasta file with passing sequences saved in: my3.vadr.pass.fa
# fasta file with failing sequences saved in: my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in: my3.vadr.sqa
# per-sequence tabular classification summary file saved in: my3.vadr.sqc
# per-feature tabular summary file saved in: my3.vadr.ftr
# per-model-segment tabular summary file saved in: my3.vadr.sgm
# per-model tabular summary file saved in: my3.vadr.mdl
# per-alert tabular summary file saved in: my3.vadr.alt
# alert count tabular summary file saved in: my3.vadr.alc
# alignment doctoring tabular summary file saved in: my3.vadr.dcr
# ungapped seed alignment summary file (-s) saved in: my3.vadr.sda
# replaced stretches of Ns summary file (-r) saved in: my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time: 00:00:14.73
# hh:mm:ss
#
[ok]
Note that all the output files will be in the newly created my3
directory.
The Summary of classified sequences
listed that two sequences passed and one failed.
The file my3.vadr.pass.list
, lists the two sequences that passed:
MT159720.1
MT308693.1
and my3.vadr.fail.list
lists the one sequence that failed:
MT159720.1/1406-G-to-T
Also, FASTA-formatted sequence files for each the passing and failing
sequences are my3.vadr.pass.fa
and my3.vadr.fail.fa
.
For the two sequences that passed, the annotation is available in the
output my3.vadr.pass.tbl
file and for the sequence that failed the
annotation is in the my3.vadr.fail.tbl
file.
my.vadr.pass.tbl
: (with the middle of the table for each sequence removed for brevity)
>Feature MT159720.1
266 21555 gene
gene ORF1ab
266 13468 CDS
13468 21555
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT159720.1_1
266 13483 CDS
product ORF1a polyprotein
protein_id MT159720.1_2
266 805 mat_peptide
product leader protein
protein_id MT159720.1_1
...snip...
28274 29533 gene
gene N
28274 29533 CDS
product nucleocapsid phosphoprotein
protein_id MT159720.1_11
29558 29674 gene
gene ORF10
29558 29674 CDS
product ORF10 protein
protein_id MT159720.1_12
29609 29644 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 1
29629 29657 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29728 29768 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
>Feature MT308693.1
217 21506 gene
gene ORF1ab
217 13419 CDS
13419 21506
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT308693.1_1
217 13434 CDS
product ORF1a polyprotein
protein_id MT308693.1_2
217 756 mat_peptide
product leader protein
protein_id MT308693.1_1
...snip...
28225 29484 gene
gene N
28225 29484 CDS
product nucleocapsid phosphoprotein
protein_id MT308693.1_11
29509 29625 gene
gene ORF10
29509 29625 CDS
product ORF10 protein
protein_id MT308693.1_12
29560 29595 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 1
29580 29608 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29679 29719 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
my.vadr.fail.tbl
(with the middle removed for brevity):
>Feature MT159720.1/1406-G-to-T
266 21555 gene
gene ORF1ab
266 13468 CDS
13468 21555
product ORF1ab polyprotein
exception ribosomal slippage
protein_id MT159720.1/1406-G-to-T_1
266 13483 CDS
product ORF1a polyprotein
protein_id MT159720.1/1406-G-to-T_2
266 805 mat_peptide
product leader protein
protein_id MT159720.1/1406-G-to-T_1
...snip...
29629 29657 stem_loop
note Coronavirus 3' UTR pseudoknot stem-loop 2
29728 29768 stem_loop
note Coronavirus 3' stem-loop II-like motif (s2m)
Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (ORF1ab polyprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 20147 nt)]
ERROR: CDS_HAS_STOP_CODON: (ORF1ab polyprotein) stop codon in protein-based alignment [stop codon(s) end at position(s) 1143]
ERROR: CDS_HAS_STOP_CODON: (ORF1a polyprotein) in-frame stop codon exists 5' of stop position predicted by homology to reference [revised to 266..1408 (stop shifted 12075 nt)]
ERROR: CDS_HAS_STOP_CODON: (ORF1a polyprotein) stop codon in protein-based alignment [stop codon(s) end at position(s) 1408]
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (leader protein) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp2) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp3) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp4) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (3C-like proteinase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp6) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp7) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp8) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp9) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp10) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (RNA-dependent RNA polymerase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (helicase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (3'-to-5' exonuclease) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (endoRNAse) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (2'-O-ribose methyltransferase) mat_peptide may not be translated because its parent CDS has a problem
ERROR: PEPTIDE_TRANSLATION_PROBLEM: (nsp11) mat_peptide may not be translated because its parent CDS has a problem
Feature table format is described at https://www.ncbi.nlm.nih.gov/Sequin/table.html.
Note that the end of the fail.tbl
file lists ERRORs for
MT159720.1/1406-G-to-T
, the third sequence in the input file, which
is identical to the first one MT159720.1
with position 1406
changed from a G
to a T
to purposefully introduce an early stop
codon for purposes of this example. Early stop codons are one reason
a sequence can fail. Note the CDS_HAS_STOP_CODON
errors at the end
of the feature table. The PEPTIDE_TRANSLATION_PROBLEM
occur for the
mature peptide features which are derived from the two CDS that have
errors.
The annotation information is also available in other files with
different formats, such as the my3/my3.vadr.ftr
file, which may be
easier to parse for some applications. For more information on how to
interpret these results and files, including file formats, see the
vadr documentation pages, linked to from
here,
or the VADR 1.0 paper (reference below).
HERE HERE HERE
The options used in the above command are the recommended set of
options for SARS-CoV-2 annotation because they are currently (as of
Jan 5, 2021) being used by NCBI sequence indexers when they evaluate
an incoming SARS-CoV-2 sequence submission. The options are each briefly
explained in the table below. More information can be found
here,
and the -s
and -r
options are explained more below the table as well.
option | explanation |
---|---|
--mxsize 64000 |
set the maximum allowed dynamic programming matrix size for the cmalign program to 64Gb |
-s |
turn on the seed acceleration heuristic: use the max length ungapped region from blastn to seed the alignment |
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
--nomisc |
specify that features for failing sequences not be changed to misc_features in the output .tbl file |
--lowsim5term 2 |
set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 5' sequence end to 2 nt |
--lowsim3term 2 |
set the minimum length for an alert related to sequence of low similarity to the RefSeq at the 3' sequence end to 2 nt |
--mkey NC_045512 |
use the model files with prefix NC_045512 in the directory from --mdir
|
--mdir /usr/local/vadr-models-corona-1.1-1 |
specify that the models to use are in directory /usr/local/vadr-models-corona-1.1-2 |
--fstlowthr 0.0 |
specify that all possible frameshifts are reported as alerts (minimum average posterior probability of 0.0 ) |
--alt_fail lowscore,fsthicnf,fstlocnf,insertnn,deletinn |
specify that lowscore, frameshirt alerts (fsthicnf,fstlocnf) and large insertions and deletions (insertnn,deletinn) cause a sequence to fail |
The -r
option replaces stretches of Ns in the input sequences with
the expected nucleotides from the RefSeq. Be cautious, as this
strategy actually replaces Ns in your sequence prior to determination
of annotation. If this is not what you want to do, then do not use
-r
. Using -r
is the current (as of May 7, 2020) practice for NCBI
indexers analyzing incoming SARS-CoV-2 sequences, which is why it is
included as a recommended option here. By doing this the Ns are
assumed to be the 'expected' sequence in the corresponding regions for
purposes of annotation. More information on -r
, including
information on other related options is
here.
The second sequence in the sars-cov2.3.fa
file includes 344 Ns,
most of which are contained within consecutive stretches of 36 and 290 Ns in two
regions, from sequence positions 11487-11522 and 19237-19526.
With the -r
option, v-annotate.pl
replaces the Ns in these two stretches with the
corresponding 'expected' nucleotides from the NC_045512 RefSeq, and then determines annotation
with that doctored sequence. Some information on the Ns in each sequence including on those that were replaced is output to a file with suffix .rpn
(format described here).
For the example above, the my3.vadr.rpn
file looks like this (scroll to right to see full file):
#seq seq seq num_Ns num_Ns fract_Ns ngaps ngaps ngaps ngaps ngaps nnt nnt replaced_coords
#idx name len model p/f tot rp rp tot int rp rp-full rp-part rp-full rp-part seq(S),mdl(M),#rp(N);
#--- ---------------------- ----- --------- ---- ------ ------ -------- ----- ----- ----- ------- ------- ------- ------- ---------------------
1 MT159720.1 29882 NC_045512 PASS 0 0 - 0 0 0 0 0 0 0 -
2 MT308693.1 29788 NC_045512 PASS 344 326 0.948 2 2 2 2 0 326 0 S:11487..11522,M:11536..11571,N:36/36;S:19237..19526,M:19286..19575,N:290/290;
3 MT159720.1/1406-G-to-T 29882 NC_045512 FAIL 0 0 - 0 0 0 0 0 0 0 -
If we run the above command without -r
, the Ns are not replaced and
the MT308693.1
sequence fails because the N regions trigger
alerts/errors to be reported. Specifically, the following three errors
are reported in the fail.tbl
file, amongst other
PEPTIDE_TRANSLATION_PROBLEM
errors for mature peptides (not shown):
ERROR: LOW_FEATURE_SIMILARITY: (ORF1ab) region within annotated feature lacks significant similarity [36 nt overlap b/t low similarity region (11487..11522) and annotated feature (217..21506), strand: +]
ERROR: LOW_FEATURE_SIMILARITY: (ORF1ab) region within annotated feature lacks significant similarity [290 nt overlap b/t low similarity region (19237..19526) and annotated feature (217..21506), strand: +]
ERROR: INDEFINITE_ANNOTATION_END: (ORF1ab polyprotein) protein-based alignment does not extend close enough to nucleotide-based alignment 3' endpoint [2271 > 8 (strand:+ CM:21506 blastx:21506 - 2271, valid stop codon in CM prediction)]
Note that the regions that are causing the LOW_FEATURE_SIMILARITY
errors are stretches of Ns. Using -r
allows some sequences like this that
only fail due to the Ns to pass, and sometimes changes their
annotation to be more accurate (based on the assumption that the replaced Ns
represent the expected nucleotides at the corresponding positions).
The -s
option speeds up v-annotate.pl
by identifying the maximum
length ungapped alignment region using blastn, and fixing that part of
the alignment to accelerate the usually much slower cmalign alignment step.
This heuristic works particularly well for many SARS-CoV-2 sequences
that are highly similar (~99% identical) to the RefSeq NC_045512. If
we run the above example command without -s
, it would require several
minutes per sequence instead of several seconds per sequence. When
-s
is used, an additional output file with suffix .sda
is output
(format described here).
More information on -s
, including information on other related
options is
here.
Identifying and annotating Coronaviridae sequences other than SARS-CoV-2 using a larger VADR model library
The command above will only compare your input sequences to the three
models described above (NC_045512,
NC_045512-del28254 and NC_045512-MW422255). If you want to annotate
v-annotate.pl
to perform the additional step of checking if each
input sequence is more similar to SARS-CoV-2 or a different
Coronaviridae RefSeq, or if your input file contains non-SARS-CoV-2
Coronaviridae sequences, you can download a different model file that includes
the NC_045512 RefSeq and 54 other Coronaviridae RefSeqs from
this FTP page.
After downloading and unpacking that model set,
change the --mkey
option in the example
command above from
--mkey NC_045512
to
--mkey corona
But be aware that changing this option will result in slower running times.
Note also that this is an older model file and it only includes the
NC_045512
model, and not the other 2 models in the 1.1.3-1 set. (I
decided to stop including the 54 other Coronaviridae models in the
1.1.3-1 set to make the file size of the model set smaller and more
portable, but if you're interested in combining these two model sets,
contact me.)
You can use VADR's v-build.pl
program to build additional models of
Coronaviridae GenBank sequences. More information is here.
- The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3