Skip to content

Commit

Permalink
Merge pull request #109 from gbouras13/dev
Browse files Browse the repository at this point in the history
v0.11.0
  • Loading branch information
gbouras13 authored Dec 4, 2024
2 parents 4881a9b + 97eab51 commit 214833d
Show file tree
Hide file tree
Showing 34 changed files with 467 additions and 170 deletions.
9 changes: 9 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
# History

## v0.11.0 Updates (4 December 2024)

* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust.
* If you input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway.
* According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15.
* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids
* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).
* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)

## v0.10.1 Updates (14 November 2024)

* Adds retry functionality for dnaapler with 1 thread if there is an error - for some genomes on some systems, [it was observed](https://github.com/gbouras13/hybracter/issues/54) that using the default resources (8 threads, 16GB RAM) will lead to an error in dnaapler.
Expand Down
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
- [Pipeline](#pipeline)
- [Benchmarking](#benchmarking)
- [Recent Updates](#recent-updates)
- [v0.11.0 Updates (4 December 2024)](#v0110-updates-4-december-2024)
- [v0.10.0 Updates (17 October 2024)](#v0100-updates-17-october-2024)
- [v0.9.0 Updates (18 September 2024)](#v090-updates-18-september-2024)
- [Why Would You Run Hybracter?](#why-would-you-run-hybracter)
Expand Down Expand Up @@ -173,6 +174,15 @@ To summarise the conclusions:

## Recent Updates

### v0.11.0 Updates (4 December 2024)

* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust.
* If your input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway.
* According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15.
* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids
* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).
* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)

### v0.10.0 Updates (17 October 2024)

* Updates Medaka to v2.0.1, implementing the `--bacteria` option by default.
Expand Down
4 changes: 2 additions & 2 deletions docs/citation.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,5 +54,5 @@ Pypolca:
Snakemake:
* Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:33 (https://doi.org/10.12688/f1000research.29032.1).

KMC:
* Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, (https://doi.org/10.1093/bioinformatics/btx304).
lrge:
* Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:10.1101/2024.11.27.625777.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ It scales massively using the embarassingly parallel power of HPC and Snakemake

![Image](hybracter.png)

- A. Reads are quality controlled and subsampled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp), [Seqkit](https://github.com/shenwei356/seqkit) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami). Chromosome length is optionally estimated with [kmc](https://github.com/refresh-bio/KMC).
- A. Reads are quality controlled and subsampled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp), [Seqkit](https://github.com/shenwei356/seqkit) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami). Chromosome length is optionally estimated with [lrge](https://github.com/mbhall88/lrge).
- B. Long-read assembly is conducted with [Flye](https://github.com/fenderglass/Flye). Each sample is clssified if the chromosome(s) were assembled (marked as 'complete') or not (marked as 'incomplete') based on the given minimum chromosome length.
- C. For complete isolates, plasmid recovery with [Plassembler](https://github.com/gbouras13/plassembler).
- D. For all isolates, long read polishing with [Medaka](https://github.com/nanoporetech/medaka).
Expand Down
2 changes: 1 addition & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ The `processing` directory will contain a number of intermediate directories who

##### 2. `qc` directory

This directory will contain the filtered, trimmed and contaminant removed FASTQ reads (where applicable) for each sample, along with the kmc estimated chromosome size (which will be used if you specify `--auto`)
This directory will contain the filtered, trimmed and contaminant removed FASTQ reads (where applicable) for each sample, along with the estimated chromosome size (which will be used if you specify `--auto`)

##### 3. `plassembler` directory

Expand Down
8 changes: 4 additions & 4 deletions docs/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ hybracter hybrid -i <input.csv> -o <output_dir> -t <threads> [other arguments]
* You can turn off pypolca polishing using `--no_pypolca` - I wouldn't though!
* You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
* By default, `hybracter hybrid` takes the last polishing round as the final assembly (`--logic last`). We would not recommend changing this to `--logic best`, as picking the best polishing round according to ALE with `--logic best` is not guaranteed to give the most accurate assembly (See our [paper](https://doi.org/10.1099/mgen.0.001244)).
* You can estimate the chromosome size with kmc by using `--auto`
* **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**
* You can estimate the chromosome size with lrge by using `--auto`
* **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), use `--auto` with care. Users have reported that it can tend to overestimate the chromosome size with kmc - for lrge, some tests have shown similar (but not so bad) behaviour. This is not really solvable - it is simply a limitation of low quality data!**
* You can set a minimum long-read depth with `--min_depth`. Hybracter will error out if your estimated long-reads coverage is lower than this.
* If you are running hybracter on a Mac, please use `--mac` (or find a Linux machine). This will make sure Medaka v1.8.0 is installed, as newer versions don't work on Macs.
* From v0.10.0, Hybracter will implement the `--bacteria` flag designed specifically for bacterial genomes. See Ryan Wick's [blogpost](https://rrwick.github.io/2024/10/17/medaka-v2.html) for some more explanation and benchmarking. If you do not want to use `--bactera`, please use `--medaka_override` to make sure hybracter uses your `--medakaModel`. This is likely most useful for R9 data.
Expand Down Expand Up @@ -246,8 +246,8 @@ hybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]
* You can turn off Medaka polishing using `--no_medaka` - recommended for Q20+ modern Nanopore and PacBio reads
* You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
* You can force `hybracter long` to pick the last polishing round (not the best according to pyrodigal mean CDS length) with `--logic last`. `hybracter long` defaults to picking the best i.e. `--logic best`.
* You can estimate the chromosome size with kmc by using `--auto`
* **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**
* You can estimate the chromosome size with lrge by using `--auto`
* **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), use `--auto` with care. Users have reported that it can tend to overestimate the chromosome size with kmc - for lrge, some tests have shown similar (but not so bad) behaviour. This is not really solvable - it is simply a limitation of low quality data!**
* You can set a minimum long-read depth with `--min_depth`. Hybracter will error out if your estimated long-reads coverage is lower than this.
* If you are running hybracter on a Mac, please use `--mac` (or find a Linux machine). This will make sure Medaka v1.8.0 is installed, as newer versions don't work on Macs.
* From v0.10.0, Hybracter will implement the `--bacteria` flag designed specifically for bacterial genomes. See Ryan Wick's [blogpost](https://rrwick.github.io/2024/10/17/medaka-v2.html) for some more explanation and benchmarking. If you do not want to use `--bactera`, please use `--medaka_override` to make sure hybracter uses your `--medakaModel`. This is likely most useful for R9 data.
Expand Down
2 changes: 1 addition & 1 deletion hybracter/hybracter.VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.10.1
0.11.0
1 change: 1 addition & 0 deletions hybracter/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,7 @@ def run_snakemake(

all_medaka_models = [

'r1041_e82_400bps_bacterial_methylation', #https://github.com/gbouras13/hybracter/issues/108
# default 1.12.1
'r1041_e82_400bps_sup_v5.0.0',
# r1041 e82 (kit14) consensus
Expand Down
8 changes: 8 additions & 0 deletions hybracter/workflow/envs/lrge.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: kmc
channels:
- conda-forge
- bioconda
dependencies:
- lrge <1
- gzip # for zcat

26 changes: 16 additions & 10 deletions hybracter/workflow/hybrid.smk
Original file line number Diff line number Diff line change
Expand Up @@ -72,23 +72,27 @@ MAC = config.args.mac
MEDAKA_OVERRIDE = config.args.medaka_override
EXTRA_PARAMS_FLYE = config.args.extra_params_flye
# flag to add extra arguments to flye
ADD_TO_FLYE = False
ADD_TO_FLYE = False

# Only if user specifies - otherwise None
if EXTRA_PARAMS_FLYE:
ADD_TO_FLYE = True
print(f"As you have used --extra_params_flye the extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye.")
ADD_TO_FLYE = True
print(
f"As you have used --extra_params_flye the extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye."
)

# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
# To take advantage of improvements with medaka v2 https://rrwick.github.io/2024/10/17/medaka-v2.html
# not true if 1) --mac or 2) --medaka_override is specified
BACTERIA = True
# not true if 1) --mac or 2) --medaka_override is specified
BACTERIA = True
if MAC:
BACTERIA = False
else:
if MEDAKA_OVERRIDE:
BACTERIA = False
print(f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria ")
print(
f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria "
)

# MAC medaka

Expand All @@ -112,7 +116,9 @@ if MAC:

# for hybracter hybrid
if config.args.single is False:
dictReads = parseSamples(INPUT, False, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO) # long flag false
dictReads = parseSamples(
INPUT, False, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO
) # long flag false
SAMPLES = list(dictReads.keys())
# for hybracter hybrid-single
else:
Expand Down Expand Up @@ -163,10 +169,10 @@ include: os.path.join("rules", "completeness", "aggregate.smk")

# for short read polishing --careful
include: os.path.join("rules", "processing", "coverage.smk")

# kmc - needs to be included due to the checkpointing
# lrge - needs to be included due to the checkpointing
include: os.path.join("rules", "processing", "estimate_chromosome.smk")


### medaka vs no medaka
# default - medaka will be run
if config.args.no_medaka is False:
Expand Down
26 changes: 14 additions & 12 deletions hybracter/workflow/long.smk
Original file line number Diff line number Diff line change
Expand Up @@ -70,28 +70,26 @@ MAC = config.args.mac
MEDAKA_OVERRIDE = config.args.medaka_override
EXTRA_PARAMS_FLYE = config.args.extra_params_flye
# flag to add extra arguments to flye
ADD_TO_FLYE = False
ADD_TO_FLYE = False

# Only if user specifies - otherwise None
if EXTRA_PARAMS_FLYE:
ADD_TO_FLYE = True
ADD_TO_FLYE = True
print(f"The extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye ")


# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
# To take advantage of improvements with medaka v2 https://rrwick.github.io/2024/10/17/medaka-v2.html
# not true if 1) --mac or 2) --medaka_override is specified
BACTERIA = True
# not true if 1) --mac or 2) --medaka_override is specified
BACTERIA = True
if MAC:
BACTERIA = False
else:
if MEDAKA_OVERRIDE:
BACTERIA = False
print(f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria ")




print(
f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria "
)


# MAC medaka
Expand All @@ -115,7 +113,9 @@ if MAC:

# for hybracter hybrid
if config.args.single is False:
dictReads = parseSamples(INPUT, True, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO) # long flag true
dictReads = parseSamples(
INPUT, True, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO
) # long flag true
SAMPLES = list(dictReads.keys())
# for hybracter hybrid-single
else:
Expand All @@ -132,9 +132,11 @@ else:
# Import rules and functions
##############################

# kmc - needs to be included due to the

# lrge - needs to be included due to the
include: os.path.join("rules", "processing", "estimate_chromosome.smk")


# qc and host
# depends on whehter --contaminants has been specified and --skip_qc flag activiated
if config.args.contaminants != "none": # where --contaminants specified
Expand Down
6 changes: 3 additions & 3 deletions hybracter/workflow/rules/assembly/assemble.smk
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ rule assemble:
"""
input:
fastq=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
lr_coverage=os.path.join(dir.out.coverage, "{sample}_lr.txt"), # make sure coverage is run before assembly in case below min_depth
lr_coverage=os.path.join(dir.out.coverage, "{sample}_lr.txt"), # make sure coverage is run before assembly in case below min_depth
output:
fasta=os.path.join(dir.out.assemblies, "{sample}", "assembly.fasta"),
info=os.path.join(dir.out.assemblies, "{sample}", "assembly_info.txt"),
Expand All @@ -20,8 +20,8 @@ rule assemble:
params:
model=FLYE_MODEL,
dir=os.path.join(dir.out.assemblies, "{sample}"),
add_to_flye = ADD_TO_FLYE,
extra_params_flye = EXTRA_PARAMS_FLYE
add_to_flye=ADD_TO_FLYE,
extra_params_flye=EXTRA_PARAMS_FLYE,
threads: config.resources.big.cpu
benchmark:
os.path.join(dir.out.bench, "assemble", "{sample}.txt")
Expand Down
15 changes: 13 additions & 2 deletions hybracter/workflow/rules/assembly/plassembler.smk
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ rule plassembler_hybrid:
l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
r1=os.path.join(dir.out.fastp, "{sample}_1.fastq.gz"),
r2=os.path.join(dir.out.fastp, "{sample}_2.fastq.gz"),
kmc=os.path.join(dir.out.kmc,"{sample}", "{sample}_kmcLOG.txt")
lrge=os.path.join(
dir.out.chrom_size, "{sample}_lrge_estimated_chromosome_size.txt"
),
output:
fasta=os.path.join(
dir.out.plassembler, "{sample}", "plassembler_plasmids.fasta"
Expand All @@ -18,7 +20,16 @@ rule plassembler_hybrid:
version=os.path.join(dir.out.versions, "{sample}", "plassembler.version"),
params:
db=dir.plassemblerdb,
chromlen = lambda wildcards: str(getMinChromLength(kmc_log_path=os.path.join(dir.out.kmc, f"{wildcards.sample}", f"{wildcards.sample}_kmcLOG.txt"), sample=wildcards.sample,auto=AUTO)),
chromlen=lambda wildcards: str(
getMinChromLength(
lrge_path=os.path.join(
dir.out.chrom_size,
f"{wildcards.sample}_lrge_estimated_chromosome_size.txt",
),
sample=wildcards.sample,
auto=AUTO,
)
),
outdir=os.path.join(dir.out.plassembler, "{sample}"),
flye_dir=os.path.join(dir.out.assemblies, "{sample}"),
depth_filter=DEPTH_FILTER,
Expand Down
17 changes: 14 additions & 3 deletions hybracter/workflow/rules/assembly/plassembler_long.smk
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@ rule plassembler_long:
runs plassembler for long
"""
input:
kmc=os.path.join(dir.out.kmc,"{sample}", "{sample}_kmcLOG.txt"),
l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz")
lrge=os.path.join(
dir.out.chrom_size, "{sample}_lrge_estimated_chromosome_size.txt"
),
l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
output:
fasta=os.path.join(
dir.out.plassembler, "{sample}", "plassembler_plasmids.fasta"
Expand All @@ -13,7 +15,16 @@ rule plassembler_long:
version=os.path.join(dir.out.versions, "{sample}", "plassembler.version"),
params:
db=dir.plassemblerdb,
chromlen = lambda wildcards: str(getMinChromLength(kmc_log_path=os.path.join(dir.out.kmc, f"{wildcards.sample}", f"{wildcards.sample}_kmcLOG.txt"), sample=wildcards.sample,auto=AUTO)),
chromlen=lambda wildcards: str(
getMinChromLength(
lrge_path=os.path.join(
dir.out.chrom_size,
f"{wildcards.sample}_lrge_estimated_chromosome_size.txt",
),
sample=wildcards.sample,
auto=AUTO,
)
),
outdir=os.path.join(dir.out.plassembler, "{sample}"),
flye_dir=os.path.join(dir.out.assemblies, "{sample}"),
depth_filter=DEPTH_FILTER,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,10 @@ rule aggregate_finalise_complete:
chrom_pre_polish_fasta=os.path.join(
dir.out.dnaapler, "{sample}_pre_chrom", "{sample}_reoriented.fasta"
),
medaka_rd_1_fasta=os.path.join(dir.out.dnaapler, "{sample}", "{sample}_reoriented.fasta"), # medaka round 1
medaka_rd_1_fasta=os.path.join(
dir.out.dnaapler, "{sample}", "{sample}_reoriented.fasta"
),
# medaka round 1
medaka_rd_2_fasta=os.path.join(
dir.out.intermediate_assemblies, "{sample}", "{sample}_medaka_rd_2.fasta"
),
Expand Down
Loading

0 comments on commit 214833d

Please sign in to comment.