Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
GavinHuttley authored Nov 8, 2024
2 parents b6f7524 + 64ec9d7 commit 04e1c40
Show file tree
Hide file tree
Showing 5 changed files with 231 additions and 20 deletions.
227 changes: 218 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,7 @@ Options:
-o, --outpath PATH write processed seqs to this filename [required]
-np, --numprocs INTEGER number of processes [default: 1]
-F, --force_overwrite Overwrite existing file if it exists
-m, --moltype [dna|rna] Molecular type of sequences, defaults to DNA
[default: dna]
-m, --moltype [dna|rna] Molecular type of sequences [default: dna]
-L, --limit INTEGER number of sequences to process
-hp, --hide_progress hide progress bars
--help Show this message and exit.
Expand Down Expand Up @@ -75,7 +74,7 @@ Usage: dvs nmost [OPTIONS]
Identify n seqs that maximise average delta JSD
Options:
-s, --seqfile PATH path to .dvtgseqs file [required]
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-n, --number INTEGER number of seqs in divergent set [required]
-k INTEGER k-mer size [default: 6]
Expand Down Expand Up @@ -150,11 +149,11 @@ named sequences are added to the final result.
Input type
----------
SequenceCollection, ArrayAlignment, Alignment
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
SequenceCollection, ArrayAlignment, Alignment
ArrayAlignment, SequenceCollection, Alignment
```
<!-- [[[end]]] -->
Expand Down Expand Up @@ -188,7 +187,7 @@ Usage: dvs max [OPTIONS]
Identify the seqs that maximise average delta JSD
Options:
-s, --seqfile PATH path to .dvtgseqs file [required]
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-z, --min_size INTEGER minimum size of divergent set [default: 7]
-zp, --max_size INTEGER maximum size of divergent set
Expand Down Expand Up @@ -273,12 +272,222 @@ named sequences are added to the final result.
Input type
----------
SequenceCollection, ArrayAlignment, Alignment
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
SequenceCollection, ArrayAlignment, Alignment
ArrayAlignment, SequenceCollection, Alignment
```
<!-- [[[end]]] -->
</details>
</details>

### `dvs ctree`: build a phylogeny using k-mers

The result of the `ctree` command is a newick formatted tree string without distances.

> **Note**
> A fuller explanation is coming soon!
<details>
<summary>Options for command line dvs ctree</summary>

<!-- [[[cog
import cog
from diverse_seq.cli import main
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(main, ["ctree", "--help"])
help = result.output.replace("Usage: main", "Usage: dvs")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: dvs ctree [OPTIONS]
Quickly compute a cluster tree based on kmers for a collection of sequences.
Options:
-s, --seqfile PATH path to .dvseqs file [required]
-o, --outpath PATH the input string will be cast to Path instance
-m, --moltype [dna|rna] Molecular type of sequences [default: dna]
-k INTEGER k-mer size [default: 6]
--sketch-size INTEGER sketch size for mash distance
-d, --distance [mash|euclidean]
distance measure for tree construction
[default: mash]
-c, --canonical-kmers consider kmers identical to their reverse
complement
-L, --limit INTEGER number of sequences to process
-np, --numprocs INTEGER number of processes [default: 1]
-hp, --hide_progress hide progress bars
--help Show this message and exit.
```
<!-- [[[end]]] -->

</details>

<details>
<summary>Options for cogent3 app dvs_ctree</summary>

The `dvs ctree` is also available as the [cogent3 app](https://cogent3.org/doc/app/index.html) `dvs_ctree` or `dvs_par_ctree`. The latter is not composable, but can run the analysis for a single collection in parallel.

<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_ctree")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
Create a cluster tree from kmer distances.
Options for making the app
--------------------------
dvs_ctree_app = get_app(
'dvs_ctree',
k=12,
sketch_size=3000,
moltype='dna',
distance_mode='mash',
mash_canonical_kmers=None,
show_progress=False,
)
Initialise parameters for generating a kmer cluster tree.
Parameters
----------
k
kmer size
sketch_size
size of sketches, only applies to mash distance
moltype
seq collection molecular type
distance_mode
mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
whether to use mash canonical kmers for mash distance
show_progress
whether to show progress bars
Notes
-----
This app is composable.
If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.
References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
Mash: fast genome and metagenome distance estimation using MinHash.
Genome biology, 17, 1-14.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
PhyloNode
```
<!-- [[[end]]] -->


<!-- [[[cog
import cog
import contextlib
import io
from cogent3 import app_help
buffer = io.StringIO()
with contextlib.redirect_stdout(buffer):
app_help("dvs_par_ctree")
cog.out(
"```\n{}\n```".format(buffer.getvalue())
)
]]] -->
```
Overview
--------
Create a cluster tree from kmer distances in parallel.
Options for making the app
--------------------------
dvs_par_ctree_app = get_app(
'dvs_par_ctree',
k=12,
sketch_size=3000,
moltype='dna',
distance_mode='mash',
mash_canonical_kmers=None,
show_progress=False,
max_workers=None,
parallel=True,
)
Initialise parameters for generating a kmer cluster tree.
Parameters
----------
k
kmer size
sketch_size
size of sketches, only applies to mash distance
moltype
seq collection molecular type
distance_mode
mash distance or euclidean distance between kmer freqs
mash_canonical_kmers
whether to use mash canonical kmers for mash distance
show_progress
whether to show progress bars
numprocs
number of workers, defaults to running serial
Notes
-----
This app is not composable but can run in parallel. It is
best suited to a single large sequence collection.
If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.
References
----------
.. [1] Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B.,
Bergman, N. H., Koren, S., & Phillippy, A. M. (2016).
Mash: fast genome and metagenome distance estimation using MinHash.
Genome biology, 17, 1-14.
Input type
----------
ArrayAlignment, SequenceCollection, Alignment
Output type
-----------
PhyloNode
```
<!-- [[[end]]] -->

</details>
9 changes: 6 additions & 3 deletions src/diverse_seq/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,6 @@ def ctree(
hide_progress: bool,
):
"""Quickly compute a cluster tree based on kmers for a collection of sequences."""

if seqfile.suffix != ".dvseqs":
dvs_util.print_colour(
"Sequence data needs to be preprocessed, use 'dvs prep'",
Expand All @@ -479,7 +478,7 @@ def ctree(
)
sys.exit(1)

seqids = dvs_data_store.get_seqids_from_store(seqfile)[:limit]
seqids = dvs_data_store.get_seqids_from_store(seqfile)
if limit is not None:
seqids = seqids[:limit]

Expand All @@ -496,7 +495,11 @@ def ctree(
show_progress=not hide_progress,
)
tree = app(seqids) # pylint: disable=not-callable
tree.write(outpath)
if not tree:
dvs_util.print_colour(tree, "red")
sys.exit(1)

tree.write(outpath)


if __name__ == "__main__":
Expand Down
10 changes: 6 additions & 4 deletions src/diverse_seq/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ def __init__(
Notes
-----
This app is composable.
If mash_canonical_kmers is enabled when using the mash distance,
kmers are considered identical to their reverse complement.
Expand Down Expand Up @@ -233,14 +235,14 @@ def make_cluster_tree(
tree_dict.pop(right_index),
)
node_index += 1

tree = make_tree(str(tree_dict[node_index - 1]))

# use string representation and then remove quotes
treestring = str(tree_dict[node_index - 1]).replace("'", "")
tree = make_tree(treestring=treestring, underscore_unmunge=True)
progress.update(tree_task, completed=1, total=1)

return tree



class DvsParCtreeMixin:
def _mash_dist(self, seq_arrays: Sequence[SeqArray]) -> numpy.ndarray:
"""Calculates pairwise mash distances between sequences in parallel.
Expand Down
1 change: 0 additions & 1 deletion src/diverse_seq/distance.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,6 @@ def mash_distances(
numpy.ndarray
Pairwise mash distances between sequences.
"""

if progress is None:
progress = Progress(disable=True)

Expand Down
4 changes: 1 addition & 3 deletions tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,9 +253,7 @@ def test_ctree(
):
outpath = tmp_dir / "out.tre"

args = (
f"-s {processed_seq_path} -o {outpath} -d {distance} -k {k} -np {max_workers}"
)
args = f"-s {processed_seq_path} -o {outpath} -d {distance} -k {k} -np {max_workers} -hp"
if sketch_size is not None:
args += f" --sketch-size {sketch_size}"
args = args.split()
Expand Down

0 comments on commit 04e1c40

Please sign in to comment.