Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Optimise CLI Usage of ctree #71

Merged
merged 8 commits into from
Nov 8, 2024
Merged

Conversation

rmcar17
Copy link
Collaborator

@rmcar17 rmcar17 commented Nov 4, 2024

Summary by Sourcery

Enhancements:

  • Refactor the dvs_par_ctree class to use a mixin for shared functionality, improving code modularity and reusability.

@rmcar17 rmcar17 requested a review from GavinHuttley November 4, 2024 05:22
Copy link

sourcery-ai bot commented Nov 4, 2024

Reviewer's Guide by Sourcery

This PR optimizes the CLI usage of the cluster tree functionality by introducing a new class dvs_cli_par_ctree that directly reads sequences from HDF5 storage, eliminating unnecessary sequence format conversions. It also includes performance improvements in the Mash sketch calculation by optimizing data structures and using NumPy arrays instead of Python lists.

Sequence diagram for optimized CLI usage of cluster tree

sequenceDiagram
    participant User
    participant CLI
    participant dvs_cli_par_ctree
    participant HDF5DataStore
    participant ClusterTree

    User->>CLI: Run ctree command
    CLI->>dvs_cli_par_ctree: Initialize with parameters
    dvs_cli_par_ctree->>HDF5DataStore: Read sequences from HDF5
    HDF5DataStore-->>dvs_cli_par_ctree: Return sequence data
    dvs_cli_par_ctree->>ClusterTree: Construct cluster tree
    ClusterTree-->>dvs_cli_par_ctree: Return PhyloNode
    dvs_cli_par_ctree-->>CLI: Return cluster tree
    CLI-->>User: Output tree to file
Loading

Updated class diagram for cluster tree classes

classDiagram
    class ClusterTreeBase {
        <<abstract>>
    }

    class DvsParCtreeMixin {
        +_mash_dist(seq_arrays: Sequence[SeqArray])
    }

    class dvs_par_ctree {
        +__init__(k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
        +main(seqs: c3_types.SeqsCollectionType) : PhyloNode
    }

    class dvs_cli_par_ctree {
        +__init__(seq_store: str | Path, limit: int | None, k: int, sketch_size: int | None, moltype: str, distance_mode: Literal["mash", "euclidean"], mash_canonical_kmers: bool | None, show_progress: bool, max_workers: int | None, parallel: bool)
        +main(seq_names: list[str]) : PhyloNode
    }

    ClusterTreeBase <|-- dvs_par_ctree
    ClusterTreeBase <|-- dvs_cli_par_ctree
    DvsParCtreeMixin <|.. dvs_par_ctree
    DvsParCtreeMixin <|.. dvs_cli_par_ctree

    note for dvs_cli_par_ctree "New class introduced to optimize CLI usage by reading sequences directly from HDF5 storage."
Loading

File-Level Changes

Change Details Files
Refactored cluster tree implementation by extracting common functionality into a mixin class
  • Created DvsParCtreeMixin class to share common functionality between cluster tree implementations
  • Moved _mash_dist and related methods to the mixin class
src/diverse_seq/cluster.py
Added new CLI-optimized parallel cluster tree implementation
  • Created dvs_cli_par_ctree class that directly reads from HDF5 storage
  • Implemented direct sequence array handling without intermediate conversions
  • Added support for limiting the number of sequences processed
src/diverse_seq/cluster.py
src/diverse_seq/cli.py
Optimized Mash sketch calculation performance
  • Replaced Python set with NumPy unique function for kmer hash deduplication
  • Changed kmer hash storage from list to NumPy array
  • Optimized heap initialization in mash_sketch function
  • Added numba.njit decorator to mash_sketch for improved performance
src/diverse_seq/distance.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rmcar17 - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

src/diverse_seq/cluster.py Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Nov 4, 2024

Pull Request Test Coverage Report for Build 11734163625

Details

  • 45 of 55 (81.82%) changed or added relevant lines in 3 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-1.0%) to 90.826%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/diverse_seq/cli.py 4 5 80.0%
src/diverse_seq/cluster.py 40 43 93.02%
src/diverse_seq/distance.py 1 7 14.29%
Files with Coverage Reduction New Missed Lines %
src/diverse_seq/data_store.py 2 87.82%
src/diverse_seq/distance.py 4 83.85%
Totals Coverage Status
Change from base Build 11720616749: -1.0%
Covered Lines: 1188
Relevant Lines: 1308

💛 - Coveralls

Copy link
Collaborator

@GavinHuttley GavinHuttley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue caught by codacy should have been caught by a test, can you add one that executes that line of code

src/diverse_seq/distance.py Show resolved Hide resolved
@GavinHuttley GavinHuttley merged commit 87b87ce into HuttleyLab:main Nov 8, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants