Skip to content

Commit

Permalink
Merge pull request #67 from gbouras13/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
gbouras13 authored Apr 2, 2024
2 parents aec7665 + 02b8011 commit 3706a74
Show file tree
Hide file tree
Showing 11 changed files with 296 additions and 43 deletions.
7 changes: 6 additions & 1 deletion HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# History

## v0.7.2 (2 April 2024)

* Adds 'circualr=True' to chromosome contig headers where Flye has marked these as such. This bug was introduced in v0.7.0.
* Thanks Nicole Lerminiaux for spotting this

## v0.7.1 (13 March 2024)

* Fixes bug where `hybracter install -d db_dir` would not work as the `-f` parameter was not being passed to Plassembler. Thanks @npbhavya
Expand Down Expand Up @@ -122,4 +127,4 @@ Commands:
config Copy the system default config file
citation Print the citation(s) for hybracter
version Print the version for hybracter
```
```
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@
- [Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies](#hybracter-enabling-scalable-automated-complete-and-accurate-bacterial-genome-assemblies)
- [Table of Contents](#table-of-contents)
- [Quick Start](#quick-start)
- [Mamba/Conda](#mambaconda)
- [Container](#container)
- [Description](#description)
- [Pipeline](#pipeline)
- [Benchmarking](#benchmarking)
Expand Down Expand Up @@ -56,6 +58,8 @@

## Quick Start

### Mamba/Conda

`hybracter` is available to install with `pip` or `conda`.

You will need conda or mamba available so `hybracter` can install all the required dependencies.
Expand All @@ -80,6 +84,27 @@ hybracter test-hybrid --threads 8
hybracter test-long --threads 8
```

### Container

Alternatively, a Docker/Singularity Linux container image is available for Hybracter (starting from v0.7.1) [here](https://quay.io/repository/gbouras13/hybracter). This will likely be useful for running Hybracter in HPC environments.

* **Note** the container image comes with the database and all environments installed - there is no need to run `hybracter install` or `hybracter test-hybrid`/`hybracter test-long` or to specify a database directory with `-d`.

To install and run v0.7.1 with singularity

```bash

IMAGE_DIR="<the directory you want the .sif file to be in >"
singularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/hybracter:0.7.1

containerImage="$IMAGE_DIR/hybracter_0.7.1.sif"

# example command with test fastqs
singularity exec $containerImage hybracter hybrid-single -l test_data/Fastqs/test_long_reads.fastq.gz \
-1 test_data/Fastqs/test_short_reads_R1.fastq.gz -2 test_data/Fastqs/test_short_reads_R2.fastq.gz \
-o output_test_singularity -t 4 -c 50000
```

## Description

`hybracter` is designed for assembling bacterial isolate genomes using a long read first assembly approach.
Expand Down
59 changes: 59 additions & 0 deletions container/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@

#
# hybracter
#

FROM --platform=linux/amd64 ubuntu:20.04
FROM staphb/unicycler:0.5.0

ENV DEBIAN_FRONTEND="noninteractive"

ARG LIBFABRIC_VERSION=1.18.1

# Install required packages and dependencies
RUN apt -y update \
&& apt -y install build-essential wget doxygen gnupg gnupg2 curl apt-transport-https software-properties-common \
git vim gfortran libtool python3-venv ninja-build python3-pip \
libnuma-dev python3-dev \
&& apt -y remove --purge --auto-remove cmake \
&& wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null\
| gpg --dearmor - | tee /etc/apt/trusted.gpg.d/kitware.gpg >/dev/null \
&& apt-add-repository -y "deb https://apt.kitware.com/ubuntu/ jammy-rc main" \
&& apt -y update

# Build and install libfabric
RUN (if [ -e /tmp/build ]; then rm -rf /tmp/build; fi;) \
&& mkdir -p /tmp/build \
&& cd /tmp/build \
&& wget https://github.com/ofiwg/libfabric/archive/refs/tags/v${LIBFABRIC_VERSION}.tar.gz \
&& tar xf v${LIBFABRIC_VERSION}.tar.gz \
&& cd libfabric-${LIBFABRIC_VERSION} \
&& ./autogen.sh \
&& ./configure \
&& make -j 16 \
&& make install

#
# Install miniforge
#
RUN set -eux ; \
curl -LO https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh ; \
bash ./Miniforge3-* -b -p /opt/miniforge3 -s ; \
rm -rf ./Miniforge3-*
ENV PATH /opt/miniforge3/bin:$PATH
#
# Install conda environment
#
ARG HYBRACTER_VERSION=0.7.1
RUN set -eux ; \
mamba install -y -c conda-forge -c bioconda -c defaults \
hybracter=${HYBRACTER_VERSION}=pyhdfd78af_0
ENV PATH /opt/miniforge3/bin:$PATH
RUN conda clean -af -y

RUN hybracter install --medaka
RUN hybracter test-hybrid --threads 8
RUN hybracter test-long --threads 16 --conda-create-envs-only
RUN rm -rf hybracter_out


67 changes: 34 additions & 33 deletions docs/benchmarking.md

Large diffs are not rendered by default.

Binary file modified docs/hybracter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 4 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ It scales massively using the embarassingly parallel power of HPC and Snakemake

![Image](hybracter.png)

- A. Reads are quality controlled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami).
- A. Reads are quality controlled and subsampled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp), [Seqkit](https://github.com/shenwei356/seqkit) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami).
- B. Long-read assembly is conducted with [Flye](https://github.com/fenderglass/Flye). Each sample is clssified if the chromosome(s) were assembled (marked as 'complete') or not (marked as 'incomplete') based on the given minimum chromosome length.
- C. For complete isolates, plasmid recovery with [Plassembler](https://github.com/gbouras13/plassembler).
- D. For all isolates, long read polishing with [Medaka](https://github.com/nanoporetech/medaka).
- E. For complete isolates, the chromosome is reorientated to begin with the dnaA gene with [dnaapler](https://github.com/gbouras13/dnaapler).
- F. For all isolates, if short reads are provided, short read polishing with [Polypolish](https://github.com/rrwick/Polypolish) and [pypolca](https://github.com/gbouras13/pypolca).
- E. For complete isolates, circularised chromosome(s) are reorientated to begin with the dnaA gene with [dnaapler](https://github.com/gbouras13/dnaapler).
- F. For all isolates, if short reads are provided, short-read polishing with [Polypolish](https://github.com/rrwick/Polypolish) and [Pypolca](https://github.com/gbouras13/pypolca) depending on [short-read depth](https://www.biorxiv.org/content/10.1101/2024.03.07.584013v1).
- G. For all isolates, assessment of all assemblies with [ALE](https://github.com/sc932/ALE) for `hybracter hybrid` or [Pyrodigal](https://github.com/althonos/pyrodigal) for `hybracter long`.
- H. The best assembly is selected and and output along with final assembly statistics.
- H. The best assembly is selected (the last is taken for `hybracter hybrid`) and output along with final assembly statistics.

21 changes: 21 additions & 0 deletions docs/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,27 @@ pip install -e .
hybracter --help
```

## Docker/Singularity

A Docker/Singularity Linux container image is available for Hybracter (starting from v0.7.1) [here](https://quay.io/repository/gbouras13/hybracter). This will likely be useful for running Hybracter in HPC environments.

* **Note** the container image comes with the database and all environments installed - there is no need to run `hybracter install` or `hybracter test-hybrid`/`hybracter test-long` or to specify a database directory with `-d`.

To install and run v0.7.1 with singularity

```bash

IMAGE_DIR="<the directory you want the .sif file to be in >"
singularity pull --dir $IMAGE_DIR docker://quay.io/gbouras13/hybracter:0.7.1

containerImage="$IMAGE_DIR/hybracter_0.7.1.sif"

# example command with test fastqs
singularity exec $containerImage hybracter hybrid-single -l test_data/Fastqs/test_long_reads.fastq.gz \
-1 test_data/Fastqs/test_short_reads_R1.fastq.gz -2 test_data/Fastqs/test_short_reads_R2.fastq.gz \
-o output_test_singularity -t 4 -c 50000
```

# Database Installation

**Note: users (and CI) have reported errors where this does not work due to an MD5 file error. This is due to Zenodo having issues (where the database lives). Waiting a few minutes and trying again usuall works.**
Expand Down
137 changes: 137 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
`hybracter` creates a number of output files in different formats.

# Main Output

The main outputs are in the `FINAL_OUTPUT` directory.

This directory will include:

## Summary File

1. `hybracter_summary.tsv` file. This gives the summary statistics for your assemblies with the following columns:

|Sample |Complete (True or False) | Total_assembly_length | Number_of_contigs | Most_accurate_polishing_round | Longest_contig_length | Longest_contig_coverage| Number_circular_plasmids |
|--------|-----------------------|-------------------------|-------------------|--------|--|--|--|


## Summary Assemblies

2. The `complete` and `incomplete` directories will contain the summary assemblies for all samples.

All samples that are denoted by hybracter to be complete will have 5 outputs in the `complete` directory:

* `sample`_summary.tsv containing the summary statistics for that sample.
* `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.
* `sample`_final.fasta containing the final assembly for that sample.
* `sample`_chromosome.fasta containing only the final chromosome(s) assembly for that sample.
* `sample`_plasmid.fasta containing only the final plasmid(s) assembly for that sample. Note this may be empty. If this is empty, then that sample had no plasmids.
* **Note** - there may be a number of non-circular "plasmid" contigs. Be careful assuming these are truly plasmids and check the plassmbler output in `supplementary_results`. These may be assembly artefacts that should be excluded, or indicate that your long- and short-read sets aren't well matched!

All samples that are denoted by hybracter to be incomplete will have 3 outputs in the `incomplete` directory:

* `sample`_summary.tsv containing the summary statistics for that sample.
* `sample`_per_contig_stats.tsv containing the contig names, lengths, GC% and whether the contig is circular.
* `sample`_final.fasta containing the final assembly for that sample.

# Other Outputs

## `supplementary_results` directory

The `supplementary_results` directory contains a number of supplementary results that you might find useful:

##### 1. `comparisons` directory

* This directory contains visual representations comparing the effect of each polishing round for each sample using a modified version of Ryan Wick's [compare_assemblies.py script](https://github.com/rrwick/Perfect-bacterial-genome-tutorial/blob/main/scripts/compare_assemblies.py). An example is below

```
contig_1 37368-37398: ACCATTTTTGTTTTATTTTTTGTAAAGACAC
contig_1 37368-37397: ACCATTTTTGTTTTA-TTTTTGTAAAGACAC
*
contig_1 43247-43277: CAACGTTGTTTTCCCTGAGCCTAAATAACCA
contig_1 43246-43276: CAACGTTGTTTTCCCCGAGCCTAAATAACCA
*
contig_1 44658-44688: CTTGATCTTTATCTATGATTTCATTAATACT
contig_1 44657-44687: CTTGATCTTTATCTACGATTTCATTAATACT
*
```

* If this file is empty, there are no differences between assemblies

##### 2. `intermediate_chromosome_assemblies` directory

* This directory contains intermediate chromosome assemblies for all polishing rounds for each sample.

##### 3. `flye_individual_summaries` directory

* This directory contains individual sample summaries from Flye for all samples.

##### 4. `plassembler_individual_summaries` directory

* This directory contains individual sample summaries from Plassembler for each sample.

##### 5. `plassembler_all_assembly_summary` directory

* This directory contains individual sample summaries from Plassembler for all samples.

##### 6. `pyrodigal_mean_length_summaries` directory

* For `long`, this directory contains pyrodigal mean CDS length summary files for each polishing round for each sample.

##### 7. `pyrodigal_mean_length_summaries_plassembler` directory

* For `long`, this directory contains pyrodigal mean CDS length summary files for each polishing round for each sample for the plassembler assembled plasmids.

## `processing` directory

The `processing` directory will contain a number of intermediate directories whose information you might find useful:

##### 1. `flye` directory

* This directory will contain the Flye assembly output and associated intermediate files for each sample

##### 2. `qc` directory

This directory will contain the filtered, trimmed and contaminant removed FASTQ reads (where applicable) for each sample

##### 3. `plassembler` directory

* This directory will contain the Plassembler assembly output and associated intermediate files for each sample

##### 4. `chrom_pre_polish` directory

* This directory will contain the pre-polished chromosome assemblies for complete isolates

##### 5. `complete` and `incomplete` directories

* These directories will contain the medaka, polypolish and pypolca polishing and dnaapler reorientation intermediate files for each sample

##### 6. `ale_out_files` directory

* For `hybrid`, this directory will intermediate ALE files for each assembly polishing round internal to `hybracter` (so can be ignored).

##### 7. `ale_scores_complete` and `ale_scores_incomplete` directories

* These directories will containin ALE scores for each assembly polishing round.

## `stderr` directory

* This will contain log files for each program in `hybracter`.

## `versions` directory

* This will contain the specific versions used for each program in `hybracter`.

## `flags` directory

* This will contain flag files internal to `hybracter` (so can be ignored).

## `completeness` directory

* This will contain flag files internal to determine completeness internal to `hybracter` (so can be ignored).

## `benchmarks` directory

* This will contain benchmarking time and memory usage statistics for each program in hybracter.

3 changes: 2 additions & 1 deletion docs/reasons.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
Why Use `hybracter`
---------
1. If you want the best possible _automated_ long read only or hybrid bacterial isolate genome assembly.
* Benchmarking TBA, but I hope it is more accurate than Unicycler, much faster and just as easy to run.
* Hybracter hybrid is more accurate than Unicycler, much faster and just as easy to run.
* Hybracter long _solves_ the long-read only plasmid assembly problem and produces the most accurate automated chromosome assembly.
2. If you need to assemble many (e.g. 10+) bacterial isolates as efficiently as possible.
* I wrote it originally because I needed to assembly more than 200 long-read sequenced isolates.
3. If you want all information about from assembly pipeline such as whether your polishing probably improved the genome, whether your assembly was likely complete, and how many plasmids you probably assembled.
Expand Down
12 changes: 8 additions & 4 deletions docs/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,10 @@ hybracter hybrid -i <input.csv> -o <output_dir> -t <threads> [other arguments]
* `--skip_qc` will skip all read QC steps.
* You can change the `--medakaModel` (all available options are listed in `hybracter hybrid -h`)
* You can change the `--flyeModel` (all available options are listed in `hybracter hybrid -h`)
* You can turn off Medaka polishing using `--no_medaka`
* You can turn off pypolca polishing using `--no_pypolca`
* By default, `hybracter hybrid` takes the last polishing round as the final assembly (`--logic last`). We would recommend never changing this, as picking the best polishing round according to ALE with `--logic best` is not guaranteed to give the most accurate assembly.
* You can turn off Medaka polishing using `--no_medaka` - recommended for Q20+ modern Nanopore reads
* You can turn off pypolca polishing using `--no_pypolca` - I wouldn't though!
* You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
* By default, `hybracter hybrid` takes the last polishing round as the final assembly (`--logic last`). We would not recommend changing this to `--logic best`, as picking the best polishing round according to ALE with `--logic best` is not guaranteed to give the most accurate assembly (See our [preprint](https://www.biorxiv.org/content/10.1101/2024.03.07.584013v1)).

```bash

Expand Down Expand Up @@ -201,9 +202,12 @@ hybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]
* `--skip_qc` will skip all read QC steps.
* You can change the `--medakaModel` (all available options are listed in `hybracter long -h`)
* You can change the `--flyeModel` (all available options are listed in `hybracter long -h`)
* You can turn off Medaka polishing using `--no_medaka`
* You can turn off Medaka polishing using `--no_medaka` - recommended for Q20+ modern Nanopore and PacBio reads
* You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
* You can force `hybracter long` to pick the last polishing round (not the best according to pyrodigal mean CDS length) with `--logic last`. `hybracter long` defaults to picking the best i.e. `--logic best`.
```bash
Usage: hybracter long [OPTIONS] [SNAKE_ARGS]...

Expand Down
Binary file modified img/hybracter.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 3706a74

Please sign in to comment.