Skip to content

Commit

Permalink
Fixes...
Browse files Browse the repository at this point in the history
  • Loading branch information
pfeiferd committed Feb 3, 2025
1 parent a635152 commit 63da0b1
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 14 deletions.
3 changes: 1 addition & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,8 @@ The meaning of the columns is a follows:
| `normalized kmers` | *k*-mer counts from column `kmers` but normalized with respect to the total number of *k*-mers per fastq file and the number of specific *k*-mers for the tax id in the database. The value allows for a less biased comparison of *k*-mer counts across fastq files and across species. It is computed as `normalizedKMersFactor` * `kmers` */ k<sub>f</sub> * u / u<sub>t</sub>*, where *k<sub>f</sub>* is the total number of *k*-mers in the fastq file, *u* is the total number of *k*-mers in the database and *u<sub>t</sub>* is the number of specific *k*-mers for the tax id in the database. `normalizedKMersFactor` is a configuration property; its default is 1000000000 (see also Section [Configuration parameters](#configuration-parameters)). |
| `exp. unique kmers` | The number of expected unique *k*-mers, which is *u<sub>t</sub> * (1 - (1 - 1/u<sub>t</sub>)*<sup>`kmers`</sup>), where *u<sub>t</sub>* is the number of specific *k*-mers for the tax id in the database. |
| `unique kmers / exp.` | The ratio `unique kmers` / `exp. unique kmers` for the tax id. This should be close to 1 for a consistent match of *k*-mers. ([This paper](https://arxiv.org/pdf/1602.05822.pdf) discusses the corresponding background distribution (of `unique kmers`).) |
| `quality prediction` | Computed as `normalized kmers` * `unique kmers / exp.`. It combines the normalized counts of *k*-mers with the valued consistency between *k*-mers and unique *k*-mers. |
| `max kmer counts` | The frequencies of the most frequent unique *k*-mers which are specific to the tax id's genome in descending order separated by `;`. This column is experimental and only added when the configuration property `matchWithKMerCounts` is set to `true`. The number of frequencies is determined via `maxKMerResCounts` (see also Section [Configuration parameters](#configuration-parameters)). |
| `reads >= 1 kmer` | Reads with at least on *k*-mer of the respective tax id. |
| `max kmer counts` | The frequencies of the most frequent unique *k*-mers which are specific to the tax id's genome in descending order separated by `;`. This column is experimental and only added when the configuration property `matchWithKMerCounts` is set to `true`. The number of frequencies is determined via `maxKMerResCounts` (see also Section [Configuration parameters](#configuration-parameters)). |
The frequencies from `max kmer counts` can be used to build frequency graph's for *k*-mers as shown below. The frequency graphs help to further assess the validity of analysis results.

<p align="center">
Expand Down
7 changes: 4 additions & 3 deletions src/main/java/org/metagene/genestrip/GSConfigKey.java
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,13 @@ public enum GSConfigKey implements ConfigKey {
+ "Using the bloom filter tends to shorten matching time, if the most part of the reads cannot be classified because they contain *no* *k*-mers from the database. "
+ "Otherwise, using the bloom filter might increase matching time by up to 30%. It also requires more main memory.")
USE_BLOOM_FILTER_FOR_MATCH("useBloomFilterForMatch", new BooleanConfigParamInfo(true), GSGoalKey.MATCH, GSGoalKey.MATCHLR),
@MDDescription("The absolute or relative maximum number of *k*-mers that do not have to be in the database for a read to be classified. "
@MDDescription("The absolute or relative maximum number of *k*-mers that do not need to be in the database for a read to be classified (read error count). "
+ "If the number is above `maxReadTaxErrorCount`, then the read will not be classified. "
+ "Otherwise the read will be classified in the same way as [done by Kraken](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46/figures/1). "
+ "If `maxReadTaxErrorCount` is >= 1, then it is interpreted as an absolute number of *k*-mers. "
+ "Otherwise (and so, if >= 0 and < 1), it is interpreted as the ratio between the *k*-mers not in the database and all *k*-mers of the read.")
MAX_READ_TAX_ERROR_COUNT("maxReadTaxErrorCount", new DoubleConfigParamInfo(0, Double.MAX_VALUE, 0.5),
+ "Otherwise (and so, if >= 0 and < 1), it is interpreted as the ratio between the *k*-mers not in the database and all *k*-mers of the read."
+ "If `maxReadTaxErrorCount` < 0, then the read error count is disregarded, which means that even a single matching *k*-mer will lead to the read's classification.")
MAX_READ_TAX_ERROR_COUNT("maxReadTaxErrorCount", new DoubleConfigParamInfo(-1, Double.MAX_VALUE, -1),
GSGoalKey.MATCH, GSGoalKey.MATCHLR),
@MDDescription("If > 0, the corresponding number of frequencies of the most frequent *k*-mers per tax id will be reported.")
MAX_KMER_RES_COUNTS("maxKMerResCounts", new IntConfigParamInfo(0, 65536, 0), GSGoalKey.MATCH, GSGoalKey.MATCHLR),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ protected Integer fromString(String s) {
public boolean isValueInRange(Object value) {
if (value instanceof Integer) {
Integer i = (Integer) value;
return i >= min || i <= max;
return i >= min && i <= max;
}
return false;
}
Expand Down Expand Up @@ -126,7 +126,7 @@ protected Long fromString(String s) {
public boolean isValueInRange(Object value) {
if (value instanceof Long) {
Long i = (Long) value;
return i >= min || i <= max;
return i >= min && i <= max;
}
return false;
}
Expand Down Expand Up @@ -165,7 +165,7 @@ protected Double fromString(String s) {
public boolean isValueInRange(Object value) {
if (value instanceof Double) {
Double d = (Double) value;
return d >= min || d <= max;
return d >= min && d <= max;
}
return false;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ public void printMatchResult(MatchingResult res, PrintStream out, Database wrapp
out.print(
"name;rank;taxid;reads;kmers from reads;kmers;unique kmers;contigs;average contig length;max contig length;max contig desc.;");
if (estimator != null) {
out.print("db coverage;normalized kmers;exp. unique kmers;unique kmers / exp.;quality prediction;");
out.print("db coverage;normalized kmers;exp. unique kmers;unique kmers / exp.;");
}
out.print("normalized reads; reads >= 1 kmer; normalized reads >= 1 kmer; reads >= 1 kmer bps; avg read >= 1 kmer len; normalized reads * avg len;");
if (res.isWithMaxKMerCounts()) {
Expand Down Expand Up @@ -204,11 +204,7 @@ public void printMatchResult(MatchingResult res, PrintStream out, Database wrapp
out.print(DF.format(expUnique));
out.print(';');

double cScore1 = stats.getUniqueKMers() / expUnique;
out.print(DF.format(cScore1));
out.print(';');

out.print(DF.format(normalizedKMers * cScore1));
out.print(DF.format(stats.getUniqueKMers() / expUnique));
out.print(';');

/*
Expand Down

0 comments on commit 63da0b1

Please sign in to comment.