Skip to content

Latest commit

 

History

History
468 lines (402 loc) · 22 KB

ribosensor.md

File metadata and controls

468 lines (402 loc) · 22 KB

ribosensor example usage, command-line options and pass/fail criteria


ribosensor combines ribotyper and another program, rRNA_sensor together to classify and validate ribosomal RNA (rRNA) sequences. It was designed for screening incoming rRNA sequence submissions to GenBank. ribotyper uses profile HMMs to analyze sequences and rRNA_sensor uses the single-sequence based blastn program to analyze sequences. The results of both programs are then combined to determine if a sequence passes or fails.


ribosensor example usage

This example runs ribosensor on a sample file of 16 sequences.

Move into a directory in which you have write permission and execute the following command:

> ribosensor $RIBOSCRIPTSDIR/testfiles/example-16.fa test-rs

Like other Ribovore scripts, ribosensor takes two required command line arguments. Optional arguments are explained below.

The first required argument is the sequence file you want to annotate. The $RIBOSCRIPTSDIR environment variable should be defined in your .bashrc or .cshrc as explained in the installation documentation.

The second required argument is the name of the output subdirectory that you would like ribosensor to create. Output files will be placed in this output directory. If this directory already exists, the program will exit with an error message indicating that you need to either (a) remove the directory before rerunning, or (b) use the -f option, in which case the directory will be overwritten. The command adding -f is:

> ribosensor -f $RIBOSCRIPTSDIR/testfiles/example-16.fa test-rs

You should see something like the following output:

# ribosensor :: analyze ribosomal RNA sequences with profile HMMs and BLASTN
# Ribovore 1.0 (Feb 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Dec 29 17:00:04 2020
# $RIBOBLASTDIR:     /usr/local/src/ribovore-install/ncbi-blast/bin
# $RIBOEASELDIR:     /usr/local/src/ribovore-install/infernal/binaries
# $RIBOINFERNALDIR:  /usr/local/src/ribovore-install/infernal/binaries
# $RIBOSCRIPTSDIR:   /usr/local/src/ribovore-install/ribovore
# $RIBOTIMEDIR:      /usr/bin
# $RRNASENSORDIR:    /usr/local/src/ribovore-install/rRNA_sensor
#
# target sequence input file:   /usr/local/src/ribovore-install/ribovore/testfiles/example-16.fa
# output directory name:        test-rs
# forcing directory overwrite:  yes [-f]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Partitioning sequence file based on sequence lengths         ... done. [    0.0 seconds]
# Running ribotyper on full sequence file                      ... done. [    3.9 seconds]
# Running rRNA_sensor on seqs of length 351..600               ... done. [    0.3 seconds]
# Running rRNA_sensor on seqs of length 601..inf               ... done. [    1.9 seconds]
# Parsing and combining rRNA_sensor and ribotyper output       ... done. [    0.0 seconds]
#
# Outcome counts:
#
# type   total  pass  indexer  submitter  unmapped
# -----  -----  ----  -------  ---------  --------
  RPSP       8     8        0          0         0
  RPSF       1     1        0          0         0
  RFSP       0     0        0          0         0
  RFSF       7     0        0          7         0
#
  *all*     16     9        0          7         0
#
# Per-program error counts:
#
#                      number   fraction
# error                of seqs   of seqs
# -------------------  -------  --------
  CLEAN                      8   0.50000
  S_NoHits                   1   0.06250
  S_TooLong                  1   0.06250
  S_LowScore                 5   0.31250
  S_LowSimilarity            5   0.31250
  R_NoHits                   1   0.06250
  R_UnacceptableModel        5   0.31250
  R_LowCoverage              1   0.06250
#
#
# GPIPE error counts:
#
#                                number   fraction
# error                          of seqs   of seqs
# -----------------------------  -------  --------
  CLEAN                                9   0.56250
  SEQ_HOM_NotSSUOrLSUrRNA              1   0.06250
  SEQ_HOM_LowSimilarity                1   0.06250
  SEQ_HOM_LengthLong                   1   0.06250
  SEQ_HOM_TaxNotExpectedSSUrRNA        5   0.31250
  SEQ_HOM_LowCoverage                  1   0.06250
#
#
# Timing statistics:
#
# stage      num seqs  seq/sec      nt/sec  nt/sec/cpu  total time             
# ---------  --------  -------  ----------  ----------  -----------------------
  ribotyper        16      4.1      5402.0      5402.0              00:00:03.94
  sensor           16      7.4      9872.4      9872.4              00:00:02.15
  total            16      2.5      3329.6      3329.6              00:00:06.38
#
#
# Human readable error-based output saved to file test-rs/test-rs.ribosensor.out
# GPIPE error-based output saved to file test-rs/test-rs.ribosensor.gpipe
#
# List and description of all output files saved in:                                  test-rs.ribosensor.list
# Output printed to screen saved in:                                                  test-rs.ribosensor.log
# List of executed commands saved in:                                                 test-rs.ribosensor.cmd
# summary of rRNA_sensor results saved in:                                            test-rs.ribosensor.sensor.out
# summary of ribotyper results saved in:                                              test-rs.ribosensor.ribo.out
# summary of combined rRNA_sensor and ribotyper results (original errors) saved in:   test-rs.ribosensor.out
# summary of combined rRNA_sensor and ribotyper results (GPIPE errors) saved in:      test-rs.ribosensor.gpipe
#
# All output files created in directory ./test-rs/
#
# Elapsed time:  00:00:06.38
#                hh:mm:ss
# 
[ok]

ribosensor outputs information on each step and how long it takes, followed by a list of output files.

ribosensor first runs ribotyper as reported in the output:

# Running ribotyper on full sequence file                      ... done. [    3.6 seconds]

Next, rRNA_sensor is executed, up to three times, once for sequences 350 nucleotides or shorter, once for sequences of length between 351 and 600 nucleotides and once for sequences more than 600 nucleotides. For each length range, different rRNA_sensor parameters are used determined to work well for that length range. For this example, there are zero sequences less than 351 nucleotides so rRNA_sensor is run only twice, as reported in the output:

# Running rRNA_sensor on seqs of length 351..600               ... done. [    0.2 seconds]
# Running rRNA_sensor on seqs of length 601..inf               ... done. [    1.7 seconds]

Finally, the ribotyper and rRNA_sensor output is parsed to determine if each sequence passes or fails each program. Each sequence fails ribotyper if it receives one or more ribotyper errors (explained more here) and otherwise passes ribotyper. Each sequence fails rRNA_sensor if it receives one or more rRNA_sensor errors (explained more here) and otherwise passes rRNA_sensor.

Based on whether it passes or fails each of the two programs, each sequence is assigned one of four possible 'outcomes':

ribotyper pass or fail? rRNA_sensor pass or fail? outcome
PASS PASS RPSP
PASS FAIL RPSF
FAIL PASS RFSP
FAIL FAIL RFSF

The counts of sequences for each outcome are output. For the above example the outcome counts are:

# Outcome counts:
#
# type   total  pass  indexer  submitter  unmapped
# -----  -----  ----  -------  ---------  --------
  RPSP       8     8        0          0         0
  RPSF       1     1        0          0         0
  RFSP       0     0        0          0         0
  RFSF       7     0        0          7         0
#
  *all*     16     9        0          7         0

The 'pass' column indicates how many sequences in each outcome class 'pass' ribosensor. All RPSP sequences pass ribosensor. Some RPSF and RFSP sequences also pass ribosensor as explained more below. All RFSF sequences fail ribosensor.

The ribosensor output also includes Per-program error counts::

# Per-program error counts:
#
#                      number   fraction
# error                of seqs   of seqs
# -------------------  -------  --------
  CLEAN                      8   0.50000
  S_NoHits                   1   0.06250
  S_TooLong                  1   0.06250
  S_LowScore                 5   0.31250
  S_LowSimilarity            5   0.31250
  R_NoHits                   1   0.06250
  R_UnacceptableModel        5   0.31250
  R_LowCoverage              1   0.06250

The CLEAN row pertains to the RPSP sequences which had zero errors in both ribotyper and rRNA_sensor. The remaining rows are for rRNA_sensor errors (those beginning with S_) and ribotyper errors (those beginning with R_), and are explained in the sections below on rRNA_sensor errors and ribotyper errors.

Finally, information on the number of GenBank errors is output:

# GENBANK error counts:
#
#                                number   fraction
# error                          of seqs   of seqs
# -----------------------------  -------  --------
  CLEAN                                9   0.56250
  SEQ_HOM_NotSSUOrLSUrRNA              1   0.06250
  SEQ_HOM_LowSimilarity                1   0.06250
  SEQ_HOM_LengthLong                   1   0.06250
  SEQ_HOM_TaxNotExpectedSSUrRNA        5   0.31250
  SEQ_HOM_LowCoverage                  1   0.06250

GenBank errors are determined by combining the ribotyper and rRNA_sensor errors as explained here.

rRNA_sensor errors in ribosensor

Any sequence that receives one or more rRNA_sensor errors will be considered to have failed rRNA_sensor.

The possible rRNA_sensor errors are listed in the table below, along with the GenBank errors they associate with.

rRNA_sensor error associated GenBank error cause/explanation
S_NoHits* SEQ_HOM_NotSSUOrLSUrRNA no hits reported ('no' in column 2)
S_NoSimilarity* SEQ_HOM_LowSimilarity coverage (column 5) of best blast hit is < 10%
S_LowSimilarity* SEQ_HOM_LowSimilarity coverage (column 5) of best blast hit is < 80% (is seq len <= 350 nt) or < 86% (if seq len > 350 nt)
S_LowScore* SEQ_HOM_LowSimilarity either id percentage below length-dependent threshold (75%,80%,86%) or E-value above 1e-40 ('imperfect_match` in column 2)
S_BothStrands SEQ_HOM_MisAsBothStrands hits on both strands ('mixed' in column 2)
S_MultipleHits SEQ_HOM_MultipleHits more than 1 hit reported (column 4 value > 1)

As an exception, the first four rRNA_sensor errors (labelled with '*') do not trigger a GenBank error and so are ignored by ribosensor (and so do not cause a sequence to fail ribosensor) if either (a) the sequence is 'RPSF' (passes ribotyper and fails rRNA_sensor) and the -c option is not used with ribosensor or (b) the sequence is 'RFSF' (fails both ribotyper and rRNA_sensor) and R_UnacceptableModel or R_QuestionableModel ribotyper errors are also reported.

ribotyper errors in ribosensor

ribotyper detects and reports up to 15 different types (depending on command-line arguments) of 'unexpected_features' for each sequence, as explained more here. In the context of ribosensor, 10 of these 15 types are detected by ribosensor and cause a sequence to fail ribotyper. They are listed below along with the GenBank errors they associate with.

ribotyper error associated GenBank error cause/explanation
R_NoHits SEQ_HOM_NotSSUOrLSUrRNA no hits reported
R_MultipleFamilies SEQ_HOM_SSUAndLSUrRNA SSU and LSU hits
R_LowScore SEQ_HOM_LowSimilarity bits/position score is < 0.5
R_BothStrands SEQ_HOM_MisAsBothStrands hits on both strands
R_InconsistentHits SEQ_HOM_MisAsHitOrder hits are in different order in sequence and model
R_DuplicateRegion SEQ_HOM_MisAsDupRegion hits overlap by 10 or more model positions
R_UnacceptableModel SEQ_HOM_TaxNotExpectedSSUrRNA best hit is to model other than expected set; 16S expected set: SSU.Archaea, SSU.Bacteria, SSU.Cyanobacteria, SSU.Chloroplast; 18S expected set: SSU.Eukarya;
R_LowCoverage SEQ_HOM_LowCoverage coverage of all hits is < 0.80 (if <= 350nt) or 0.86 (if > 350nt)
R_QuestionableModel+ SEQ_HOM_TaxQuestionableSSUrRNA best hit is to a 'questionable' model (if mode is 16S: SSU.Chloroplast, if mode is 18S, does not apply)
R_MultipleHits+ SEQ_HOM_MultipleHits more than 1 hit reported

As an exception, the final two errors (labelled with '+') do not trigger a GenBank error and so are ignored by ribosensor (and so do not cause a sequence to fail ribosensor) if the sequence is 'RFSP' (fails ribotyper but passes rRNA_sensor).

GenBank errors in ribosensor

A sequence fails ribosensor if it has one or more GenBank errors. Each GenBank error is triggered by one or more rRNA_sensor and/or ribotyper errors as shown in the table below:

GenBank error triggering rRNA_sensor/ribotyper errors
SEQ_HOM_NotSSUOrLSUrRNA S_NoHits*, R_NoHits
SEQ_HOM_LowSimilarity S_NoSimilarity*, S_LowSimilarity*, S_LowScore*, R_LowScore
SEQ_HOM_SSUAndLSUrRNA R_MultipleFamilies
SEQ_HOM_MisAsBothStrands S_BothStrands, R_BothStrands
SEQ_HOM_MisAsHitOrder R_InconsistentHits
SEQ_HOM_MisAsDupRegion R_DuplicateRegion
SEQ_HOM_TaxNotExpectedSSUrRNA R_UnacceptableModel
SEQ_HOM_TaxQuestionableSSUrRNA R_QuestionableModel+
SEQ_HOM_LowCoverage R_LowCoverage
SEQ_HOM_MultipleHits S_MultipleHits, R_MultipleHits+

There are two classes of exceptions in the above table marked by two different superscripts in the table: '*': these rRNA_sensor errors do not trigger a GenBank error if: (a) the sequence is 'RPSF' (passes ribotyper and fails rRNA_sensor) and the -c option is not used with ribosensor or (b) the sequence is 'RFSF' (fails both ribotyper and rRNA_sensor) and R_UnacceptableModel or R_QuestionableModel are also reported. '+': these ribotyper errors do not trigger a GenBank error if sequence is 'RFSP' (fails ribotyper but passes rRNA_sensor).


Using ribosensor for 18S eukaryotic SSU rRNA sequences

The above example run on the example-16.fa file runs ribosensor in its default mode for validation of 16S SSU rRNA sequences from bacteria or archaea. Alterntatively, ribosensor can be run in 18S mode to validate 18S SSU rRNA eukaryotic sequences using the command line option -m 18S. For example, we can run ribosensor on the same input file in 18S mode with the following command:

> ribosensor -m 18S $RIBOSCRIPTSDIR/testfiles/example-16.fa test-rs

In this case, only 4 sequences pass. These are 4 of the 16 sequences in the input file which are eukaryotic SSU sequences:

# GPIPE error counts:
#
#                                number   fraction
# error                          of seqs   of seqs
# -----------------------------  -------  --------
  CLEAN                                4   0.25000
  SEQ_HOM_NotSSUOrLSUrRNA              1   0.06250
  SEQ_HOM_LengthLong                   1   0.06250
  SEQ_HOM_TaxNotExpectedSSUrRNA       10   0.62500
  SEQ_HOM_LowCoverage                  1   0.06250

Currently, 16S (default) and 18S are the only two available modes, but we hope to add additional modes in the future. Contact me at eric.nawrocki@nih.gov if there's a specific mode you'd like to request.


rRNA_sensor blastn databases

rRNA_sensor includes two blastn databases: one with 1267 sequences for 16S SSU rRNA (archaeal and bacterial) and one with 1091 sequences for eukaryotic 18S SSU rRNA. These were created by clustering larger datasets and only keeping one sequence from each cluster as described more in the Ribovore 1.0 paper. Following Ribovore installation, the FASTA files for these databases will be available here:

$RIBOINSTALLDIR/rRNA_sensor/16S_centroids.fa
$RIBOINSTALLDIR/rRNA_sensor/18S_centroids.1091.fa 

ribosensor modelinfo files

The default ribosensor modelinfo file is in $RIBOSCRIPTSDIR/models/ribosensor.modelinfo:

> cat $RIBOSCRIPTSDIR/models/ribosensor.modelinfo
# Model info file for ribosensor.
# Contains information on name of modelinfo file and accept
# file to use with ribotyper and BLAST database to use 
# with rRNA_sensor.
# 
# Each non-# prefixed line has 4 tokens, separated by a space.
# First token is a shorthand name for a mode, these are acceptable
#   arguments to the ribosensor -m option.
# Second token is the rRNA_sensor BLAST DB name for that mode.
# Third token is the ribotyper modelinfo file name for that mode.
# Fourth token is the ribotyper accept file name for that mode.
16S 16S_centroids      ribotyper.modelinfo ribosensor.ssu-arc-bac.accept
18S 18S_centroids.1091 ribotyper.modelinfo ribosensor.ssu-euk.accept

If the -i option is not used, then the ribotyper modelinfo file and ribotyper accept file must exist in the default model directory ($RIBOSCRIPTSDIR/models/). If -i <s> is used then ribosensor will first look for the ribotyper modelinfo and accept models in the default model directory, if it finds files with the names specified in the ribosensor modelinfo file then it will use those, if not it will then look for the ribotyper modelinfo and accept models in the same directory as the file supplied with the -i option. If the files are still not found, ribosensor will exit in error.

Currently, the only two blast databases included with rRNA_sensor are 16S_centroids and 18S_centroids.1091, so these are the only two blast databases ribosensor can be used with. We hope to add more in the future. You can construct your own. Email eric.nawrocki@nih.gov to request a specific gene/tax group for a future release or for help creating your own.

The ribosensor.modelinfo file is the only ribosensor modelinfo file provided with Ribovore, but you can create additional files by copying it and modifying it as necessary.

For more information on ribotyper modelinfo files, see here. For more information on ribotyper accept files, see here.


List of all command-line options

You can see all the available command line options to ribosensor by calling it at the command line with the -h option:

> ribosensor -h
# ribosensor :: analyze ribosomal RNA sequences with profile HMMs and BLASTN
# Ribovore 1.0 (Feb 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:    Tue Dec 29 14:56:08 2020
#
Usage: ribosensor [-options] <fasta file to annotate> <output directory>

basic options:
  -f           : force; if <output directory> exists, overwrite it
  -m <s>       : set mode to <s>, possible <s> values are "16S" and "18S" [16S]
  -c           : assert that sequences are from cultured organisms
  -n <n>       : use <n> CPUs [0]
  -v           : be verbose; output commands to stdout as they're run
  -i <s>       : use model info file <s> instead of default
  --keep       : keep all intermediate files that are removed by default
  --skipsearch : skip search stages, use results from earlier run

rRNA_sensor related options:
  --Sminlen <n>    : set rRNA_sensor minimum sequence length to <n> [100]
  --Smaxlen <n>    : set rRNA_sensor minimum sequence length to <n> [2000]
  --Smaxevalue <x> : set rRNA_sensor maximum E-value to <x> [1e-40]
  --Sminid1 <n>    : set rRNA_sensor minimum percent id for seqs <= 350 nt to <n> [75]
  --Sminid2 <n>    : set rRNA_sensor minimum percent id for seqs [351..600] nt to <n> [80]
  --Sminid3 <n>    : set rRNA_sensor minimum percent id for seqs > 600 nt to <n> [86]
  --Smincovall <n> : set rRNA_sensor minimum coverage for all sequences to <n> [10]
  --Smincov1 <n>   : set rRNA_sensor minimum coverage for seqs <= 350 nt to <n> [80]
  --Smincov2 <n>   : set rRNA_sensor minimum coverage for seqs  > 350 nt to <n> [86]

options for saving sequence subsets to files:
  --psave : save passing sequences to a file

options for parallelizing cmsearch on a compute farm:
  -p         : parallelize ribotyper and rRNA_sensor on a compute farm
  -q <s>     : use qsub info file <s> instead of default
  -s <n>     : seed for random number generator is <n> [181]
  --nkb <n>  : number of KB of sequence for each farm job is <n> [100]
  --wait <n> : allow <n> wall-clock minutes for jobs on farm to finish, including queueing time [500]
  --errcheck : consider any farm stderr output as indicating a job failure

Questions, comments or feature requests? Send a mail to eric.nawrocki@nih.gov.