REVICA-STRM

STRM = Streamlined

This is a fork of the original REVICA, meant for faster assembly of viral genomes from metagenomics sequencing data. Other changes include streamlined (combined) output files and quality-of-life features for multi-organism/reference sequencing data (e.g. Influenza with 8 segments).

REVICA is a reference-based viral consensus genome assembly pipeline for some of the most common respiratory viruses. REVICA currently supports genome assembly of:

Enterovirus (EV)
Seasonal human coronavirus (HCOV)
Human metapneumovirus (HMPV)
Human respiratory syncytial virus (HRSV)
Human parainfluenza virus (HPIV)
Measles morbillivirus (MeV)
Influenza A virus (IAV)
Influenza B virus (IBV)
Human adenovirus (HAdV)
SARS-CoV-2

How it works

REVICA-STRM creates assembly genomes from raw FASTQ files in 2 fundamental steps:

Create an initial, rough consensus sequence. A given query sequence is aligned to all entries in a given database, with the best-mapping reference used to create an early consensus sequence with the general predicted features.
Generate the final consensus for the query. The query is realigned to the consensus from the previous step, and this alignment is used to generate the final assembly. This second pass serves to confirm unique features of the query that may have been suppressed during the previous assembly.

The important outputs are found in the output directory (specified by --output), in the final_files subdirectory:

the final consensus sequence (filename contains assembly.fa)
the BAM file(s) generated from the first alignment: this is useful for gauging how much of your RAW FASTQ data could be mapped to a known reference sequence. Located in the align_to_selected_ref directory.
a MULTIQC report file named <run name>_multiqc.html

Databases

This repository includes two example reference databases usable for assembly:

general respiratory viruses

assets/ref.fa: a general purpose database containing sequences for the aforementioned supported viruses

influenza only

assets/flu.fasta: the database used in the Andersen Lab's avian influenza project. This database has been supplemented with recent influenza A H5, H1N1 and H3N1 strains, as well as Influenza B, across all available species collected from 2023-2024.

Workflow

Usage

REVICA is built to be run on the cloud via NextFlow and Docker. Cloning this repo is only necessary if you want the databases, scripts, or test data.

Install Nextflow
Install Docker
it's recommended to use Docker signed out; access to certain containers is sporadically blocked if signed in. This issue is being actively investigated.
Ensure the Docker client is running before starting the pipeline

Using the test data included in this repo:

Clone the repository to get the example data and database
git clone git@github.com:epiliper/nf-rev.git

cd nf-rev

Run the pipeline with the example data

nextflow run greninger-lab/REVICA-STRM -r main -latest --input example_samplesheet.csv --output example_output -profile docker --db assets/flu.fasta

After the run has finished, the final output files can be found in <work_folder, default=run>/final_files.

If not using example data, replace the FASTQ files, sample sheet, and database with whatever files you want to use (see below).

Using other FASTQ files and databases:

Cloning this repo is not necessary unless you need the example data.

download FASTQ files for needed samples/SRA projects. The SRA toolkit's fasterq_dump utility can be used for downloading FASTQ files from SRA projects.
in the directory with the downloaded FASTQs, use this repo's included script bin/fastq_dir_to_samplesheet.py to create a REVICA sample sheet.

Example command:
```
python3 fastq_dir_to_samplesheet.py <dir with fastqs> -r1 _1.fastq.gz -r2 _2.fastq.gz sras_to_run.csv     
```
where -r1 and -r2 specify the suffixes of input read1 and read2 files to look for, respectively. For single-end data, just use -r1.

run REVICA and point it to your sample sheet:

nextflow run greninger-lab/REVICA-STRM -r main -latest --input sras_to_run.csv --output example_output -profile docker --db assets/flu.fasta

Note

This repo includes a python script bin/pull_sra.py to download FASTQ files for SRA project numbers specified in a CSV spreadsheet, and create an associated REVICA spreadsheet.

To use it, download SRA toolkit, add it to $PATH, and populate a csv file with SRAs in the format according to assets/example_sras.csv

then run python3 pull_sra.py <sra csv> <name for REVICA samplesheet>

Removing host (human) reads

Inputs to REVICA-STRM can optionally be filtered with Kraken2 and a user-supplied Kraken2 database. This database should be comprised of host/contaminant genomes desired to be removed from downstream analysis.

To use this, run REVICA-STRM with the --run-kraken2 and --kraken2_variants_host_filter commands, and point the --kraken2_db argument to your kraken2 database.

Note

To create a database we recommend for removal of human reads, see these instructions.

Options

Option	Explanation
`--input`	samplesheet in csv format with fastq information
`--output`	output directory (default: revica_output)
`--db`	(multi)fasta file to overwrite the bundled viral database
`--run_name`	name for the summary tsv file (default: 'run')
`--skip_fastp`	skip adapters and reads trimming using fastp (default: false)
`--run_kraken2`	run Kraken2 for classifying reads (default: false)
`--kraken2_db`	Kraken2 database for reads classification, needs to be specified when using `--run_kraken2`
`--kraken2_variants_host_filter`	use reads that didn't map to the kraken2 database for downstream consensus calling
`--save_kraken2_unclassified_reads`	save reads that didn't map to the specified kraken2 database
`--save_kraken2_classified_reads`	save reads that map to the specified kraken2 database
`--trim_len`	minimum read length to keep (default:50)
`--save_trimmed_reads`	save trimmed fastq
`--save_temp_files`	save temporary files
`--sample`	downsample fastq to a certain fraction or number of reads
`--ref_min_median_cov`	minimum median coverage on a reference for consensus assembly (default: 3)
`--ref_min_genome_cov`	minimum reference coverage percentage for consensus assembly (default: 60%)
`--ivar_consensus_t`	minimum frequency threshold to call consensus (default: 0.6)
`--ivar_consensus_q`	minimum quality score threshold to call consensus (default: 20)
`--ivar_consensus_m`	minimum depth to call consensus (default: 5)

Usage notes

Samplesheet example: assets/samplesheet.csv
You can create a samplesheet using the bundled python script: python bin/fastq_dir_samplesheet.py fastq_dir samplesheet_name.csv
Memory and CPU usage for pipeline processes can be adjusted in conf/base.config
Process arguments can be adjusted in conf/modules.config
You can use your own reference(s) for consensus genome assembly by specifying the --db parameter followed by your fasta file.
- reference header format: >reference_accession reference_tag reference_header_info
- it's important to tag the fasta sequences for the same species or gene segments with the same name or abbreviation in the header section, otherwise the pipeline will generate a consensus genome for every reference where the median coverage of the first alignment exceed the specified threshold (default 3).
- Revica works with segmented viral genomes, just keep the different gene segments separated and tag them in the reference fasta file
If you are using Docker on Linux, check out these post-installation steps (especially cgroup swap limit capabilities support) for configuring Linux to work better with Docker.
By default, Docker has full access to full RAM and CPU resources of the host, but if you are using MacOS, go to Settings -> Resources in Docker Desktop to make sure enough resources are allocated to docker containers.

Contact

For bug reports, please raise an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
assets		assets
bin		bin
conf		conf
docker_alpine		docker_alpine
docker_ubuntu		docker_ubuntu
fastq_example		fastq_example
modules		modules
rv_genotyping_blastdb		rv_genotyping_blastdb
subworkflows		subworkflows
CITATION.cff		CITATION.cff
README.md		README.md
example_awsbatch.config		example_awsbatch.config
example_samplesheet.csv		example_samplesheet.csv
main.nf		main.nf
making_kraken2_human_db.md		making_kraken2_human_db.md
nextflow.config		nextflow.config
revica_workflow_diagram.png		revica_workflow_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

REVICA-STRM

How it works

Databases

general respiratory viruses

influenza only

Workflow

Usage

Using the test data included in this repo:

Using other FASTQ files and databases:

Removing host (human) reads

Options

Usage notes

Contact

About

Releases

Packages

Languages

greninger-lab/REVICA-STRM

Folders and files

Latest commit

History

Repository files navigation

REVICA-STRM

How it works

Databases

general respiratory viruses

influenza only

Workflow

Usage

Using the test data included in this repo:

Using other FASTQ files and databases:

Removing host (human) reads

Options

Usage notes

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages