Merge pull request #109 from gbouras13/dev

v0.11.0
gbouras13 · Dec 4, 2024 · 214833d · 214833d
2 parents 4881a9b + 97eab51
commit 214833d
Show file tree

Hide file tree

Showing 34 changed files with 467 additions and 170 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,14 @@
 # History
 
+## v0.11.0 Updates (4 December 2024)
+
+* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust. 
+    * If you input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway. 
+    * According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15. 
+* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids
+* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).
+* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)
+
 ## v0.10.1 Updates (14 November 2024)
 
 * Adds retry functionality for dnaapler with 1 thread if there is an error - for some genomes on some systems, [it was observed](https://github.com/gbouras13/hybracter/issues/54) that using the default resources (8 threads, 16GB RAM) will lead to an error in dnaapler. 

diff --git a/README.md b/README.md
@@ -30,6 +30,7 @@
   - [Pipeline](#pipeline)
   - [Benchmarking](#benchmarking)
   - [Recent Updates](#recent-updates)
+    - [v0.11.0 Updates (4 December 2024)](#v0110-updates-4-december-2024)
     - [v0.10.0 Updates (17 October 2024)](#v0100-updates-17-october-2024)
     - [v0.9.0 Updates (18 September 2024)](#v090-updates-18-september-2024)
   - [Why Would You Run Hybracter?](#why-would-you-run-hybracter)
@@ -173,6 +174,15 @@ To summarise the conclusions:
 
 ## Recent Updates
 
+### v0.11.0 Updates (4 December 2024)
+
+* Replaces [kmc](https://github.com/refresh-bio/KMC) with [lrge](https://github.com/mbhall88/lrge) when using `--auto`, a much faster tool designed for the purpose of estimating genome size from long reads. It is very very fast and robust. 
+    * If your input has more than 5000 long reads (it should!), [lrge](https://github.com/mbhall88/lrge) will run in default settings. If it has under this, then it will run a (slightly) more computationally expensive all-vs-all mode with all input reads. In practice, if you have such low read counts, you should take all downstream analysis (inclduing lrge and hybracter) with a lot of caution anyway. 
+    * According to the [preprint](https://www.biorxiv.org/content/10.1101/2024.11.27.625777v1) (and my less exhaustive testing), lrge is more accurate and much faster than kmc, but I would still be careful using it on data that has lower quality than < Q15. 
+* Nothing else changes - the estimated chromosome size used by Hybracter will still be 80% of the estimate, as it needs to account for plasmids
+* Adds `r1041_e82_400bps_bacterial_methylation` as an option for `--medakaModel` thanks to [this issue](https://github.com/gbouras13/hybracter/issues/108).
+* Note this won't work if you run `hybracter` on a Mac (as medaka v2 is not available)
+
 ### v0.10.0 Updates (17 October 2024)
 
 * Updates Medaka to v2.0.1, implementing the `--bacteria` option by default.

diff --git a/docs/citation.md b/docs/citation.md
@@ -54,5 +54,5 @@ Pypolca:
 Snakemake:
 * Mölder F, Jablonski KP, Letcher B et al. Sustainable data analysis with Snakemake [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2021, 10:33 (https://doi.org/10.12688/f1000research.29032.1).
 
-KMC:
-* Marek Kokot, Maciej Długosz, Sebastian Deorowicz, KMC 3: counting and manipulating k-mer statistics, Bioinformatics, Volume 33, Issue 17, 01 September 2017, Pages 2759–2761, (https://doi.org/10.1093/bioinformatics/btx304).
+lrge:
+* Hall, M. B.; Coin, L. J. M. Genome Size Estimation from Long Read Overlaps. bioRxiv 2024, 2024.11.27.625777. doi:10.1101/2024.11.27.625777.
diff --git a/docs/index.md b/docs/index.md
@@ -13,7 +13,7 @@ It scales massively using the embarassingly parallel power of HPC and Snakemake
 
 ![Image](hybracter.png)
 
-- A. Reads are quality controlled and subsampled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp), [Seqkit](https://github.com/shenwei356/seqkit) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami). Chromosome length is optionally estimated with [kmc](https://github.com/refresh-bio/KMC).
+- A. Reads are quality controlled and subsampled with [Filtlong](https://github.com/rrwick/Filtlong), [Porechop](https://github.com/rrwick/Porechop), [fastp](https://github.com/OpenGene/fastp), [Seqkit](https://github.com/shenwei356/seqkit) and optionally contaminant removal using modules from [trimnami](https://github.com/beardymcjohnface/Trimnami). Chromosome length is optionally estimated with [lrge](https://github.com/mbhall88/lrge).
 - B. Long-read assembly is conducted with [Flye](https://github.com/fenderglass/Flye). Each sample is clssified if the chromosome(s) were assembled (marked as 'complete') or not (marked as 'incomplete') based on the given minimum chromosome length.
 - C. For complete isolates, plasmid recovery with [Plassembler](https://github.com/gbouras13/plassembler).
 - D. For all isolates, long read polishing with [Medaka](https://github.com/nanoporetech/medaka).

diff --git a/docs/output.md b/docs/output.md
@@ -93,7 +93,7 @@ The `processing` directory will contain a number of intermediate directories who
 
 ##### 2. `qc` directory
 
- This directory will contain the filtered, trimmed and contaminant removed FASTQ reads (where applicable) for each sample, along with the kmc estimated chromosome size (which will be used if you specify `--auto`)
+ This directory will contain the filtered, trimmed and contaminant removed FASTQ reads (where applicable) for each sample, along with the estimated chromosome size (which will be used if you specify `--auto`)
 
 ##### 3. `plassembler` directory
 

diff --git a/docs/run.md b/docs/run.md
@@ -38,8 +38,8 @@ hybracter hybrid -i <input.csv> -o <output_dir> -t <threads>  [other arguments]
 * You can turn off pypolca polishing using `--no_pypolca` - I wouldn't though!
 * You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
 * By default, `hybracter hybrid` takes the last polishing round as the final assembly  (`--logic last`). We would not recommend changing this to `--logic best`, as picking the best polishing round according to ALE with  `--logic best` is not guaranteed to give the most accurate assembly (See our [paper](https://doi.org/10.1099/mgen.0.001244)).
-* You can estimate the chromosome size with kmc by using `--auto`
-    * **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**
+* You can estimate the chromosome size with lrge by using `--auto`
+    * **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), use `--auto` with care. Users have reported that it can tend to overestimate the chromosome size with kmc - for lrge, some tests have shown similar (but not so bad) behaviour. This is not really solvable - it is simply a limitation of low quality data!**
 * You can set a minimum long-read depth with `--min_depth`. Hybracter will error out if your estimated long-reads coverage is lower than this.
 * If you are running hybracter on a Mac, please use `--mac` (or find a Linux machine). This will make sure Medaka v1.8.0 is installed, as newer versions don't work on Macs.
 * From v0.10.0, Hybracter will implement the `--bacteria` flag designed specifically for bacterial genomes. See Ryan Wick's [blogpost](https://rrwick.github.io/2024/10/17/medaka-v2.html) for some more explanation and benchmarking. If you do not want to use `--bactera`, please use `--medaka_override` to make sure hybracter uses your `--medakaModel`. This is likely most useful for R9 data.
@@ -246,8 +246,8 @@ hybracter long -i <input.csv> -o <output_dir> -t <threads> [other arguments]
 * You can turn off Medaka polishing using `--no_medaka` - recommended for Q20+ modern Nanopore and PacBio reads
 * You can change the `--depth_filter` from 0.25x chromosome coverage. This will filter out all Plassembler contigs below this depth.
 * You can force `hybracter long` to pick the last polishing round (not the best according to pyrodigal mean CDS length) with `--logic last`. `hybracter long` defaults to picking the best i.e. `--logic best`.
-* You can estimate the chromosome size with kmc by using `--auto`
-    * **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), `--auto` is _not_ recommended. Users have reported that it can tend to overestimate the chromosome size as more erroneous 21-mers will be counted by kmc than expected. Please specify a chromosome size for this type of data.**
+* You can estimate the chromosome size with lrge by using `--auto`
+    * **Note: if you have low quality long read sets (e.g. R9 FAST/HAC or sub Q15 reads), use `--auto` with care. Users have reported that it can tend to overestimate the chromosome size with kmc - for lrge, some tests have shown similar (but not so bad) behaviour. This is not really solvable - it is simply a limitation of low quality data!**
 * You can set a minimum long-read depth with `--min_depth`. Hybracter will error out if your estimated long-reads coverage is lower than this.
 * If you are running hybracter on a Mac, please use `--mac` (or find a Linux machine). This will make sure Medaka v1.8.0 is installed, as newer versions don't work on Macs.
 * From v0.10.0, Hybracter will implement the `--bacteria` flag designed specifically for bacterial genomes. See Ryan Wick's [blogpost](https://rrwick.github.io/2024/10/17/medaka-v2.html) for some more explanation and benchmarking. If you do not want to use `--bactera`, please use `--medaka_override` to make sure hybracter uses your `--medakaModel`. This is likely most useful for R9 data.

diff --git a/hybracter/hybracter.VERSION b/hybracter/hybracter.VERSION
@@ -1 +1 @@
-0.10.1
+0.11.0
diff --git a/hybracter/util.py b/hybracter/util.py
@@ -216,6 +216,7 @@ def run_snakemake(
 
 all_medaka_models = [
 
+    'r1041_e82_400bps_bacterial_methylation', #https://github.com/gbouras13/hybracter/issues/108
     # default 1.12.1
     'r1041_e82_400bps_sup_v5.0.0',
     # r1041 e82 (kit14) consensus

diff --git a/hybracter/workflow/envs/lrge.yaml b/hybracter/workflow/envs/lrge.yaml
@@ -0,0 +1,8 @@
+name: kmc
+channels:
+    - conda-forge
+    - bioconda
+dependencies: 
+    - lrge <1
+    - gzip # for zcat
+
diff --git a/hybracter/workflow/hybrid.smk b/hybracter/workflow/hybrid.smk
@@ -72,23 +72,27 @@ MAC = config.args.mac
 MEDAKA_OVERRIDE = config.args.medaka_override
 EXTRA_PARAMS_FLYE = config.args.extra_params_flye
 # flag to add extra arguments to flye
-ADD_TO_FLYE =  False
+ADD_TO_FLYE = False
 
 # Only if user specifies - otherwise None
 if EXTRA_PARAMS_FLYE:
-    ADD_TO_FLYE =  True
-    print(f"As you have used --extra_params_flye the extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye.")
+    ADD_TO_FLYE = True
+    print(
+        f"As you have used --extra_params_flye the extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye."
+    )
 
-# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards 
+# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
 # To take advantage of improvements with medaka v2 https://rrwick.github.io/2024/10/17/medaka-v2.html
-# not true if 1) --mac or 2) --medaka_override is specified 
-BACTERIA = True 
+# not true if 1) --mac or 2) --medaka_override is specified
+BACTERIA = True
 if MAC:
     BACTERIA = False
 else:
     if MEDAKA_OVERRIDE:
         BACTERIA = False
-        print(f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria ")
+        print(
+            f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria "
+        )
 
 # MAC medaka
 
@@ -112,7 +116,9 @@ if MAC:
 
 # for hybracter hybrid
 if config.args.single is False:
-    dictReads = parseSamples(INPUT, False, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO)  # long flag false
+    dictReads = parseSamples(
+        INPUT, False, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO
+    )  # long flag false
     SAMPLES = list(dictReads.keys())
 # for hybracter hybrid-single
 else:
@@ -163,10 +169,10 @@ include: os.path.join("rules", "completeness", "aggregate.smk")
 
 # for short read polishing --careful
 include: os.path.join("rules", "processing", "coverage.smk")
-
-# kmc - needs to be included due to the checkpointing
+# lrge - needs to be included due to the checkpointing
 include: os.path.join("rules", "processing", "estimate_chromosome.smk")
 
+
 ### medaka vs no medaka
 # default - medaka will be run
 if config.args.no_medaka is False:

diff --git a/hybracter/workflow/long.smk b/hybracter/workflow/long.smk
@@ -70,28 +70,26 @@ MAC = config.args.mac
 MEDAKA_OVERRIDE = config.args.medaka_override
 EXTRA_PARAMS_FLYE = config.args.extra_params_flye
 # flag to add extra arguments to flye
-ADD_TO_FLYE =  False
+ADD_TO_FLYE = False
 
 # Only if user specifies - otherwise None
 if EXTRA_PARAMS_FLYE:
-    ADD_TO_FLYE =  True
+    ADD_TO_FLYE = True
     print(f"The extra parameters {EXTRA_PARAMS_FLYE} will be used with Flye ")
 
 
-# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards 
+# By default, hybracter (linux) will use --bacteria from v0.10.0 onwards
 # To take advantage of improvements with medaka v2 https://rrwick.github.io/2024/10/17/medaka-v2.html
-# not true if 1) --mac or 2) --medaka_override is specified 
-BACTERIA = True 
+# not true if 1) --mac or 2) --medaka_override is specified
+BACTERIA = True
 if MAC:
     BACTERIA = False
 else:
     if MEDAKA_OVERRIDE:
         BACTERIA = False
-        print(f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria ")
-
-
-
-
+        print(
+            f"As you have selected --medaka_override, {MEDAKA_MODEL} will be used with Medaka, not --bacteria "
+        )
 
 
 # MAC medaka
@@ -115,7 +113,9 @@ if MAC:
 
 # for hybracter hybrid
 if config.args.single is False:
-    dictReads = parseSamples(INPUT, True, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO)  # long flag true
+    dictReads = parseSamples(
+        INPUT, True, SUBSAMPLE_DEPTH, DATADIR, MIN_DEPTH, AUTO
+    )  # long flag true
     SAMPLES = list(dictReads.keys())
 # for hybracter hybrid-single
 else:
@@ -132,9 +132,11 @@ else:
 # Import rules and functions
 ##############################
 
-# kmc - needs to be included due to the
+
+# lrge - needs to be included due to the
 include: os.path.join("rules", "processing", "estimate_chromosome.smk")
 
+
 # qc and host
 # depends on whehter --contaminants has been specified and --skip_qc flag activiated
 if config.args.contaminants != "none":  # where --contaminants specified

diff --git a/hybracter/workflow/rules/assembly/assemble.smk b/hybracter/workflow/rules/assembly/assemble.smk
@@ -5,7 +5,7 @@ rule assemble:
     """
     input:
         fastq=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
-        lr_coverage=os.path.join(dir.out.coverage, "{sample}_lr.txt"), # make sure coverage is run before assembly in case below min_depth
+        lr_coverage=os.path.join(dir.out.coverage, "{sample}_lr.txt"),  # make sure coverage is run before assembly in case below min_depth
     output:
         fasta=os.path.join(dir.out.assemblies, "{sample}", "assembly.fasta"),
         info=os.path.join(dir.out.assemblies, "{sample}", "assembly_info.txt"),
@@ -20,8 +20,8 @@ rule assemble:
     params:
         model=FLYE_MODEL,
         dir=os.path.join(dir.out.assemblies, "{sample}"),
-        add_to_flye = ADD_TO_FLYE,
-        extra_params_flye = EXTRA_PARAMS_FLYE
+        add_to_flye=ADD_TO_FLYE,
+        extra_params_flye=EXTRA_PARAMS_FLYE,
     threads: config.resources.big.cpu
     benchmark:
         os.path.join(dir.out.bench, "assemble", "{sample}.txt")

diff --git a/hybracter/workflow/rules/assembly/plassembler.smk b/hybracter/workflow/rules/assembly/plassembler.smk
@@ -9,7 +9,9 @@ rule plassembler_hybrid:
         l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
         r1=os.path.join(dir.out.fastp, "{sample}_1.fastq.gz"),
         r2=os.path.join(dir.out.fastp, "{sample}_2.fastq.gz"),
-        kmc=os.path.join(dir.out.kmc,"{sample}", "{sample}_kmcLOG.txt")
+        lrge=os.path.join(
+            dir.out.chrom_size, "{sample}_lrge_estimated_chromosome_size.txt"
+        ),
     output:
         fasta=os.path.join(
             dir.out.plassembler, "{sample}", "plassembler_plasmids.fasta"
@@ -18,7 +20,16 @@ rule plassembler_hybrid:
         version=os.path.join(dir.out.versions, "{sample}", "plassembler.version"),
     params:
         db=dir.plassemblerdb,
-        chromlen = lambda wildcards: str(getMinChromLength(kmc_log_path=os.path.join(dir.out.kmc, f"{wildcards.sample}", f"{wildcards.sample}_kmcLOG.txt"), sample=wildcards.sample,auto=AUTO)),
+        chromlen=lambda wildcards: str(
+            getMinChromLength(
+                lrge_path=os.path.join(
+                    dir.out.chrom_size,
+                    f"{wildcards.sample}_lrge_estimated_chromosome_size.txt",
+                ),
+                sample=wildcards.sample,
+                auto=AUTO,
+            )
+        ),
         outdir=os.path.join(dir.out.plassembler, "{sample}"),
         flye_dir=os.path.join(dir.out.assemblies, "{sample}"),
         depth_filter=DEPTH_FILTER,

diff --git a/hybracter/workflow/rules/assembly/plassembler_long.smk b/hybracter/workflow/rules/assembly/plassembler_long.smk
@@ -3,8 +3,10 @@ rule plassembler_long:
     runs plassembler for long
     """
     input:
-        kmc=os.path.join(dir.out.kmc,"{sample}", "{sample}_kmcLOG.txt"),
-        l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz")
+        lrge=os.path.join(
+            dir.out.chrom_size, "{sample}_lrge_estimated_chromosome_size.txt"
+        ),
+        l=os.path.join(dir.out.qc, "{sample}_filt_trim.fastq.gz"),
     output:
         fasta=os.path.join(
             dir.out.plassembler, "{sample}", "plassembler_plasmids.fasta"
@@ -13,7 +15,16 @@ rule plassembler_long:
         version=os.path.join(dir.out.versions, "{sample}", "plassembler.version"),
     params:
         db=dir.plassemblerdb,
-        chromlen = lambda wildcards: str(getMinChromLength(kmc_log_path=os.path.join(dir.out.kmc, f"{wildcards.sample}", f"{wildcards.sample}_kmcLOG.txt"), sample=wildcards.sample,auto=AUTO)),
+        chromlen=lambda wildcards: str(
+            getMinChromLength(
+                lrge_path=os.path.join(
+                    dir.out.chrom_size,
+                    f"{wildcards.sample}_lrge_estimated_chromosome_size.txt",
+                ),
+                sample=wildcards.sample,
+                auto=AUTO,
+            )
+        ),
         outdir=os.path.join(dir.out.plassembler, "{sample}"),
         flye_dir=os.path.join(dir.out.assemblies, "{sample}"),
         depth_filter=DEPTH_FILTER,

diff --git a/hybracter/workflow/rules/finalise/select_best_assembly_long.smk b/hybracter/workflow/rules/finalise/select_best_assembly_long.smk
@@ -59,7 +59,10 @@ rule aggregate_finalise_complete:
         chrom_pre_polish_fasta=os.path.join(
             dir.out.dnaapler, "{sample}_pre_chrom", "{sample}_reoriented.fasta"
         ),
-        medaka_rd_1_fasta=os.path.join(dir.out.dnaapler, "{sample}", "{sample}_reoriented.fasta"), # medaka round 1
+        medaka_rd_1_fasta=os.path.join(
+            dir.out.dnaapler, "{sample}", "{sample}_reoriented.fasta"
+        ),
+        # medaka round 1
         medaka_rd_2_fasta=os.path.join(
             dir.out.intermediate_assemblies, "{sample}", "{sample}_medaka_rd_2.fasta"
         ),