Merge branch 'snp_optim' into 'master'

Normalize variants database to optimize database reads while maintaining write performance. See merge request algorithm/megalodon!11
nanoporetech · Oct 3, 2019 · 2b2ab73 · 2b2ab73
2 parents fba9e63 + 118d832
commit 2b2ab73
Show file tree

Hide file tree

Showing 23 changed files with 1,333 additions and 997 deletions.
diff --git a/README.rst b/README.rst
@@ -6,10 +6,10 @@
 Megalodon
 """""""""
 
-Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided SNP and modified base calling.
+Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided sequence variant and modified base calling.
 
 Megalodon anchors the information rich neural network basecalling output to a reference genome.
-Variants, modified bases or alternative canonical bases, are then proposed and scored in order to produce highly-accurate reference anchored modified base or SNP calls.
+Variants, modified bases or alternative canonical bases, are then proposed and scored in order to produce highly-accurate reference anchored modified base or sequence variant calls.
 
 Detailed documentation for all ``megalodon`` arguments and algorithms can be found on the `megalodon documentation page <https://nanoporetech.github.io/megalodon/>`_.
 
@@ -48,11 +48,11 @@ Megalodon is accessed via the command line interface ``megalodon`` command.
     # Example command calling variants and CpG methylation
     #   Compute settings: GPU devices 0 and 1 with 8 CPU cores
     megalodon raw_fast5s/ \
-        --outputs basecalls mappings snps mods \
+        --outputs basecalls mappings variants mods \
         --reference reference.fa --variant-filename variants.vcf.gz \
         --mod-motif Z CG 0 --devices 0 1 --processes 8 --verbose-read-progress 3
 
-This command produces the ``megalodon_results`` output directory containing basecalls, mappings, SNP and modified base results.
+This command produces the ``megalodon_results`` output directory containing basecalls, mappings, sequence variants and modified base results.
 The format for each output is described below.
 
 .. note::
@@ -76,7 +76,7 @@ Inputs
   - Format: VCF or BCF
 
     - If not indexed, indexing will be performed
-  - Megalodon currently requires a set of candidate variants for ``--outputs snps``.
+  - Megalodon currently requires a set of candidate variants for ``--outputs variants``.
   - Only small indels (default less than ``50`` bases) are tested by default.
 
     - Specify the ``--max-indel-size`` argument to process larger indels
@@ -108,14 +108,14 @@ Outputs
   - Aggregated calls
 
     - Aggregated calls are output in either bedMethyl format (default; one file per modified base), a VCF variant format (including all modified bases) or wiggle format (one file per modified base/strand combination).
-- SNP Variant Calls
+- Sequence Variant Calls
 
-  - Per-read SNP Calls
+  - Per-read Variant Calls
 
-    - SQL DB containing scores at each tested reference location
+    - SQL DB containing scores for each tested variant
 
-      - Contains a single ``snps`` table indexed by reference position
-    - Tab-delimited output can be produced by adding the ``--write-snps-text`` flag
+      - Contains a single ``variants`` table indexed by reference position
+    - Tab-delimited output can be produced by adding the ``--write-variants-text`` flag
   - Aggregated calls
 
     - Format: VCF
@@ -138,13 +138,13 @@ Note that the model parameters must (currently) be loaded into each GPU process
 The ``--chunk-size`` and ``--chunk-overlap`` arguments allow users to specify read chunking, but signal normalization is always carried out over the entire read.
 
 A number of helper processes will be spawned in order to perform more minor tasks, which should take minimal compute resources.
-These include enumerating read ids and files, collecting and reporting progress information and getting data from read processing queues and writing outputs (basecalls, mappings, SNPs and modified bases).
+These include enumerating read ids and files, collecting and reporting progress information and getting data from read processing queues and writing outputs (basecalls, mappings, sequence variants and modified bases).
 
 Model Compatibility
 -------------------
 
 The model and calibration files included with megalodon are applicable only to MinION or GridION R9.4.1 flowcells.
-New models trained with taiyaki can be used with megalodon, but in order to obtain the highest performance the megalodon (SNP and modified base) calibration files should be reproduced for any new model (TODO provide walkthrough).
+New models trained with taiyaki can be used with megalodon, but in order to obtain the highest performance the megalodon (variant and modified base) calibration files should be reproduced for any new model (TODO provide walkthrough).
 
 The included model contains 5mC and 6mA capabilities.
 5mC was trained only in the human (CpG) and E. coli (CCWGG) contexts while the 6mA was trained only on the E. coli (GATC) context.

diff --git a/docs/advanced_arguments.rst b/docs/advanced_arguments.rst
@@ -17,34 +17,37 @@ Output Arguments
   - A file containing ``read_ids`` to process (one per line).
   - Used in the variant phasing pipeline.
 
--------------
-SNP Arguments
--------------
+--------------------------
+Sequence Variant Arguments
+--------------------------
 
-- ``--disable-snp-calibration``
+- ``--disable-variant-calibration``
 
-  - Use raw neural network SNP scores.
+  - Use raw neural network sequence variant scores.
   - This option should be set when calibrating a new model.
-  - Default: Calibrate scores as described in ``--snp-calibration-filename``
+  - Default: Calibrate scores as described in ``--variant-calibration-filename``
 - ``--heterozygous-factors``
 
   - Bayes factor used when computing heterozygous probabilities in diploid variant calling mode.
   - Two factors must be provided for single base subsititution variants and indels.
 - ``--max-indel-size``
 
   - Maximum indel size to include in testing. Default: 50
-- ``--snp-all-paths``
+- ``--variant-all-paths``
 
   - Compute the forward algorithm all paths score.
   - Default: Viterbi best-path score.
-- ``--snp-calibration-filename``
+- ``--variant-calibration-filename``
 
-  - File containing emperical calibration for SNP scores.
-  - As created by megalodon/scripts/calibrate_snp_llr_scores.py.
+  - File containing emperical calibration for sequence variant scores.
+  - As created by megalodon/scripts/calibrate_variant_llr_scores.py.
   - Default: Load default calibration file.
 - ``--variant-context-bases``
 
   - Context bases for single base SNP and indel calling. Default: [10, 30]
+- ``--variant-locations-on-disk``
+
+  - Force sequence variant locations to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store locations in memory and on disk.
 - ``--write-vcf-log-probs``
 
   - Write per-read alt log probabilities out in non-standard VCF field.
@@ -99,6 +102,9 @@ Modified Base Arguments
       - A genotype field ``VALID_DP`` indicates the number of reads included in the proportion modified calculation.
       - Modified base proportion estimates are stored in genotype fields specified by the single letter modified base encodings (definied in the model file).
 
+- ``--mod-positions-on-disk``
+
+  - Force modified base positions to be stored only within on disk database table. This option will reduce the RAM memory requirement, but may drastically slow processing. Default: Store positions in memory and on disk.
 - ``--write-mod-log-probs``
 
   - Write per-read modified base log probabilities out in non-standard VCF field.
@@ -120,9 +126,9 @@ This output category is intended for use in generating reference sequences for t
 - ``--refs-include-mods``
 
   - Include modified base calls in per-read reference output.
-- ``--refs-include-snps``
+- ``--refs-include-variants``
 
-  - Include SNP calls in per-read reference output.
+  - Include sequence variant calls in per-read reference output.
 - ``--refs-percent-identity-threshold``
 
   - Only include reads with higher percent identity in per-read reference output.

diff --git a/docs/algorithm_details.rst b/docs/algorithm_details.rst
@@ -2,7 +2,7 @@
 Megalodon Algorithm Details
 ***************************
 
-This page describes the details of how megalodon processes the raw nanopore signal to produce highly-accurate modified base and SNP calls.
+This page describes the details of how megalodon processes the raw nanopore signal to produce highly-accurate modified base and sequence variant calls.
 
 ------------
 Base Calling
@@ -23,11 +23,11 @@ The neural network output is anchored to the reference via standard read mapping
 If no reference mapping is produced (using ``minimap2`` via the ``mappy`` python interface) that read is not processed further (basecalls will be output if requested).
 This standard read mapping is processed to produce a matching of each basecall with a reference position.
 Reference positions within an insertion or deletion are assigned to the previous mapped read position (left justified).
-This constitutes the reference anchoring used for modified base and SNP calling steps.
+This constitutes the reference anchoring used for modified base and sequence variant calling steps.
 
------------
-SNP Calling
------------
+------------------------
+Sequence Variant Calling
+------------------------
 
 Megalodon currently filters alleles over a certain maximum size (default 50) as performance on larger indels has not currenty been validated.
 
@@ -43,14 +43,14 @@ The difference between these two scores is the assigned score for the proposed v
 Lower (negative) score are evidence for the alternative sequence and higher (positive) scores are evidence for the reference sequence.
 
 These raw scores are softmax values over potential states, to match characteristics of a probability distribution.
-In practice, these scores do not match emperical probabilities for a SNP given a truth dataset.
+In practice, these scores do not match emperical probabilities for a variant given a truth dataset.
 Thus a calibration step is applied to convert these scores to estimated emperical probabilities.
 This enables more accurate aggregation across reads.
 
 Finally, calls across reads at each reference location are aggregated in order make a sample-level call.
 These results will be output into a VCF format file.
 
-Currently ``diploid`` (default) and ``haploid`` SNP aggregation modes are available.
+Currently ``diploid`` (default) and ``haploid`` variant aggregation modes are available.
 In ``haploid`` mode the probability of the reference and alternative alleles are simply the normalized (via Bayes' theorem) product of the individual read probabilities.
 In ``diploid`` mode the probability of each genotype (homozygous reference, heterozygous and homozygous alternative) are computed.
 The probabilities for homozygous alleles are as in the ``haploid`` mode, while the heterozygous probability is given by the weighted sum of the maximal probabilities taken over the sampling distribution (binomial with ``p=0.5``) given a true diploid heterozygous allele.

diff --git a/docs/common_arguments.rst b/docs/common_arguments.rst
@@ -35,7 +35,7 @@ Output Arguments
 - ``--outputs``
 
   - Specify desired outputs.
-  - Options are ``basecalls``, ``mod_basecalls``, ``mappings``, ``whatshap_mappings``, ``per_read_snps``, ``per_read_mods``, ``snp``, and ``mods``.
+  - Options are ``basecalls``, ``mod_basecalls``, ``mappings``, ``whatshap_mappings``, ``per_read_variants``, ``per_read_mods``, ``variants``, and ``mods``.
 
     - ``mod_basecalls`` are currently output in an HDF5 file with a data set corresponding to each read (accessed via the ``read_id``).
     - ``whatshap_mappings`` are intended only for obtaining highly accurate phased variant genotypes.
@@ -77,11 +77,11 @@ Sequence Variant Arguments
     - Variants file must be sorted.
     - If variant file is not compressed and indexed this will be performed before further processing.
   - Variants must be matched to the ``--reference`` provided.
-- ``--write-snps-text``
+- ``--write-variants-text``
 
-  - Output per-read SNPs in text format.
+  - Output per-read variants in text format.
 
-    - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``ref_log_prob``, ``alt_log_prob``, ``snp_ref_seq``, ``snp_alt_seq``, ``snp_id``
+    - Output includes columns: ``read_id``, ``chrm``, ``strand``, ``pos``, ``ref_log_prob``, ``alt_log_prob``, ``var_ref_seq``, ``var_alt_seq``, ``var_id``
     - Log probabilities are calibrated to match observed log-likelihood ratios from ground truth samples.
 
       - Reference log probabilities are included to make processing mutliple alternative allele sites easier to process.

diff --git a/docs/computing_considerations.rst b/docs/computing_considerations.rst
@@ -36,12 +36,12 @@ A separate thread is linked to each per-read processing worker in order to acces
 Thus users may notice threads opened for this processing.
 These threads will generally consume very little compute.
 
------------------------------
-SNP and Modified Base Calling
------------------------------
+---------------------------------
+Variant and Modified Base Calling
+---------------------------------
 
-SNP and modified base calling is computed within the per-read processing workers using CPU resources.
+Sequence variant and modified base calling is computed within the per-read processing workers using CPU resources.
 Generally, this portion of processing will comsume a minority of the compute resources.
-Proposing many SNPs (e.g. all possible 3+ base indels) may show a bottle neck at this portion of processing.
+Proposing many variants (e.g. all possible 3+ base indels) may show a bottle neck at this portion of processing.
 Internal testing shows that proposal of all possible single base substitutions shows minimal processing at this portion of per-read processing.
-Note that the data bases storing per-read SNP variants may show slower indexing with very large proposed SNP sets (performed at the end of per-read processing).
+Note that the database storing per-read variant score may show slower indexing with a very large number of proposed variant sets (performed at the end of per-read processing).
diff --git a/docs/conf.py b/docs/conf.py
@@ -265,7 +265,7 @@
 texinfo_documents = [
   ('index', project, u'{} Documentation'.format(project),
    u'Oxford Nanopore Technologies', project,
-   'Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided SNP and modified base calling.',
+   'Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided sequence variant and modified base calling.',
    'Miscellaneous'),
 ]
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -2,7 +2,7 @@
 Welcome to Megalodon's documentation!
 *************************************
 
-Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided SNP and modified base calling.
+Megalodon provides "basecalling augmentation" for raw nanopore sequencing reads, including direct, reference-guided sequence variant and modified base calling.
 
 Megalodon anchors the information rich neural network basecalling output to a reference genome. Variants, either modified bases or alternative canonical bases, are then proposed and scored in order to produce highly-accurate reference anchored calls.
 
@@ -45,14 +45,14 @@ Megalodon is accessed via the command line interface, ``megalodon`` command.
    # Example command calling variants and CpG methylation
    #   Compute settings: GPU devices 0 and 1 with 8 CPU cores
    megalodon raw_fast5s/ \
-       --outputs basecalls mappings snps mods \
+       --outputs basecalls mappings variants mods \
        --reference reference.fa --variant-filename variants.vcf \
        --mod-motif Z CG 0 --devices 0 1 --processes 8 --verbose-read-progress 3
 
-This command produces the ``megalodon_results`` output directory containing basecalls, mappings, SNP and modified base results.
+This command produces the ``megalodon_results`` output directory containing basecalls, mappings, sequence variant and modified base results.
 
 The majority of megalodon's functionality is accessed via the ``megalodon`` command (exemplified above), though a small number of additional scripts are found in the ``scripts`` directory of the code repository.
-These include independent modified base or SNP aggregation (much faster than re-computing per-read calls), modified base result validation, and model statistic calibration.
+These include independent modified base or variant aggregation (much faster than re-computing per-read calls), modified base result validation, and model statistic calibration.
 Helper scripts to perform sequence variant phasing (details here :doc:`variant_phasing`) are also included in the ``scripts`` directory of the repository.
 
 --------

diff --git a/docs/variant_phasing.rst b/docs/variant_phasing.rst
@@ -21,7 +21,7 @@ Workflow
 
    # run megalodon to produce whatshap_mappings
    megalodon \
-       $reads_dir --outputs mappings snps whatshap_mappings \
+       $reads_dir --outputs mappings variants whatshap_mappings \
        --reference $ref --variant-filename $variants_vcf \
        --output-directory $out_dir \
        --processes $nproc --devices $gpu_devices \
@@ -58,12 +58,12 @@ Workflow
        $out_dir/whatshap_mappings
    python \
        megalodon/scripts/run_aggregation.py \
-       --outputs snps --haploid --output-suffix haplotype_1 \
+       --outputs variants --haploid --output-suffix haplotype_1 \
        --read-ids-filename $out_dir/whatshap_mappings.haplotype_1_read_ids.txt \
        --reference $ref --processes $nproc
    python \
        megalodon/scripts/run_aggregation.py \
-       --outputs snps --haploid --output-suffix haplotype_2 \
+       --outputs variants --haploid --output-suffix haplotype_2 \
        --read-ids-filename $out_dir/whatshap_mappings.haplotype_2_read_ids.txt \
        --reference $ref --processes $nproc