diff --git a/CHANGELOG.md b/CHANGELOG.md index bf9cf47..b24f37a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,15 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [v0.3.2] +### Fixes +- [thresholds] OOB panic fix, #244 +- [dmr, pair] Allow strand information in regions to be provided. #240 +- [dmr, multi] Fix problem when many pairs are provided (#229) +### Adds +- [sample-probs] Change output format of probabilities table to make it easier to parse, also change schema. Output HTML documents with nicer tables. +- [ci] Build in Ubuntu-16 due to Centos7 EOL. + ## [v0.3.1] ### Fixes - [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical. diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md index 40f478b..358ef33 100644 --- a/book/src/SUMMARY.md +++ b/book/src/SUMMARY.md @@ -3,6 +3,7 @@ - [Quick Start guides](./quick_start.md) - [Constructing bedMethyl tables](./intro_bedmethyl.md) - [Updating and adjusting MM tags](./intro_adjust.md) + - [Inspecting base modification probabilities](./intro_sample_probs.md) - [Summarizing a modBAM](./intro_summary.md) - [Making a motif BED file](./intro_motif_bed.md) - [Extracting read information to a table](./intro_extract.md) @@ -17,6 +18,7 @@ - [Narrow output to specific positions](./intro_include_bed.md) - [Extended subcommand help](./advanced_usage.md) - [Troubleshooting](./troubleshooting.md) +- [Frequently asked questions](./faq.md) - [Current limitations](./limitations.md) - [Performance considerations](./perf_considerations.md) - [Algorithm details](./algo_details.md) diff --git a/book/src/advanced_usage.md b/book/src/advanced_usage.md index cdfaa03..8bdfd5f 100644 --- a/book/src/advanced_usage.md +++ b/book/src/advanced_usage.md @@ -281,8 +281,8 @@ Usage: modkit adjust-mods [OPTIONS] Arguments: - BAM file to collapse mod call from. Can be a path to a file or one of `-` or `stdin` to - specify a stream from standard input. + Input BAM file, can be a path to a file or one of `-` or `stdin` to specify a stream from + standard input. File path to new BAM file to be created. Can be a path to a file or one of `-` or `stdin` @@ -325,6 +325,58 @@ Options: --output-sam Output SAM format instead of BAM. + --filter-probs + Filter out the lowest confidence base modification probabilities. + + -n, --num-reads + Sample approximately this many reads when estimating the filtering threshold. If + alignments are present reads will be sampled evenly across aligned genome. If a region is + specified, either with the --region option or the --sample-region option, then reads will + be sampled evenly across the region given. This option is useful for large BAM files. In + practice, 10-50 thousand reads is sufficient to estimate the model output distribution and + determine the filtering threshold. + + [default: 10042] + + --sample-region + Specify a region for sampling reads from when estimating the threshold probability. If + this option is not provided, but --region is provided, the genomic interval passed to + --region will be used. Format should be :- or . + + --sampling-interval-size + Interval chunk size to process concurrently when estimating the threshold probability, can + be larger than the pileup processing interval. + + [default: 1000000] + + -p, --filter-percentile + Filter out modified base calls where the probability of the predicted variant is below + this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence + modification calls. + + [default: 0.1] + + --filter-threshold + Specify the filter threshold globally or per primary base. A global filter threshold can + be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by + colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for + cytosine modification calls. Additional per-base thresholds can be specified by repeating + the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a + single base option and a default for all other bases with: --filter-threshold A:0.70 + --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all + other base modification calls. + + --mod-threshold + Specify a passing threshold to use for a base modification, independent of the threshold + for the primary sequence base or the default. For example, to set the pass threshold for + 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as + usual and used for canonical cytosine and other modifications unless the + `--filter-threshold` option is also passed. See the online documentation for more details. + + --only-mapped + Only use base modification probabilities from bases that are aligned when estimating the + filter threshold (i.e. ignore soft-clipped, and inserted bases). + --suppress-progress Hide the progress bar @@ -398,13 +450,9 @@ Options: -o, --out-dir Directory to deposit result tables into. Required for model probability histogram output. - Creates two files probabilities.tsv and probabilities.txt The .txt contains - ASCII-histograms and the .tsv contains tab-separated variable data represented by the - histograms. --prefix - Label to prefix output files with. E.g. 'foo' will output foo_thresholds.tsv, - foo_probabilities.tsv, and foo_probabilities.txt. + Label to prefix output files with. --force Overwrite results if present. @@ -431,11 +479,6 @@ Options: --hist Output histogram of base modification prediction probabilities. - --buckets - Number of buckets for the histogram, if used. - - [default: 128] - -n, --num-reads Approximate maximum number of reads to use, especially recommended when using a large BAM without an index. If an indexed BAM is provided, the reads will be sampled evenly over the @@ -1397,7 +1440,7 @@ sample is input as a bgzip pileup bedMethyl (produced by pileup, for example) th tabix index. Output is a BED file with the score column indicating the magnitude of the difference in methylation between the two samples. See the online documentation for additional details. -Usage: modkit dmr pair [OPTIONS] -a -b --ref +Usage: modkit dmr pair [OPTIONS] --ref Options: -a diff --git a/book/src/dmr_scoring_details.md b/book/src/dmr_scoring_details.md index 37ee8e6..dffa12e 100644 --- a/book/src/dmr_scoring_details.md +++ b/book/src/dmr_scoring_details.md @@ -118,7 +118,8 @@ The model is a simple 2-state hidden Markov model, shown below, where the two hi ![hmm](./images/hmm2.png "2-state segmenting HMM") -The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples. + +The model is run over the intersection of the modified positions in a [pileup](./intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples. ## Transition parameters There are two transition probability parameters, \\(p\\) and \\(d\\). diff --git a/book/src/faq.md b/book/src/faq.md new file mode 100644 index 0000000..86e9c8e --- /dev/null +++ b/book/src/faq.md @@ -0,0 +1,40 @@ +# Frequently asked questions + +## How are base modification probabilities calculated? + +Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position. +The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical: + +\\[ \ +P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m} \ +\\] + +where \\(\textbf{M}\\) is the set of all of the potential modifications for the base. + +For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\). +Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine: + +\\[ +P_{\text{5mC}} = 0.7, \\\\ +P_{\text{5hmC}} = 0.2, \\\\ +P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\\\ +\\] + +A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign. + +## What value for `--filter-threshold` should I use? + +The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities. +The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step. +This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed. +The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\) column instead of the \\( \text{N}\_{\text{valid}} \\) column. +For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data. +The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value. +This percentile can be changed with the `--filter-percentile` option. +Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively. +It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs` [command](./intro_sample_probs.md) is designed for this task. +Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification. + + + + diff --git a/book/src/intro_sample_probs.md b/book/src/intro_sample_probs.md new file mode 100644 index 0000000..5b2f606 --- /dev/null +++ b/book/src/intro_sample_probs.md @@ -0,0 +1,28 @@ +# Inspecting base modification probabilities + +> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated) + +For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results. +However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds. +The `modkit sample-probs` command is designed for this task. +There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base. +This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value. +To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities. +Using the command this way produces 3 files in the `$output_dir`: +1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads. +1. Another HTML document containing the proportions of each probability emitted. +1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value. + +The schema of the table is as follows: + +| column | name | description | type | +|--------|-----------------|----------------------------------------------------------------------------------------------|--------| +| 1 | code | modification code or '-' for canonical | string | +| 2 | primary base | the primary DNA base for which the code applies | string | +| 3 | range_start | the inclusive start probability of the bin | float | +| 4 | range_end | the exclusive end probability of the bin | float | +| 5 | count | the total count of probabilities falling in this bin | int | +| 6 | frac | the fraction of the total calls for this code/primary base in this bin | float | +| 7 | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float | + +From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold` [accordingly](./filtering.md). diff --git a/docs/404.html b/docs/404.html index 1640711..7d499f1 100644 --- a/docs/404.html +++ b/docs/404.html @@ -92,7 +92,7 @@