Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linted paper scripts and notebooks #81

Merged
merged 2 commits into from
Nov 11, 2024
Merged

Conversation

GavinHuttley
Copy link
Collaborator

@GavinHuttley GavinHuttley commented Nov 11, 2024

Summary by Sourcery

Lint and refactor paper scripts and notebooks to enhance code quality and maintainability, and add comprehensive documentation for the 'diverse-seq' application.

Enhancements:

  • Lint and refactor paper scripts and notebooks to improve code quality and maintainability.

Documentation:

  • Add comprehensive documentation for the 'diverse-seq' application, detailing its features, usage, and performance.

This was written collaboartively by myself, Katherine Caley
and Robert McArthur in a separate repo located
here https://github.com/GavinHuttley/DiverseSeqPaper

That repo is now archived and all subsequent edits will go via
the main DiverseSeq repository.
Copy link

sourcery-ai bot commented Nov 11, 2024

Reviewer's Guide by Sourcery

This PR adds Python scripts and Jupyter notebooks for generating figures and analyzing results for the paper. The changes include scripts for benchmarking performance, evaluating clustering algorithms, and generating visualizations of the results.

Class diagram for new classes in jsd_v_dist.py

classDiagram
    class min_dist {
        -dists
        -num
        +__init__(dists)
        +__call__(names: list[str]) -> float
    }

    class compare_sets {
        -dist_size: int
        -dvgt
        +__init__(app, dist_size: int)
        +main(aln: c3_types.AlignedSeqsType) -> dict
    }

    class make_sample {
        -pool_sizes: dict
        -seq_len: int
        +main(num: int) -> c3_types.UnalignedSeqsType
    }

    class seqcoll_to_records {
        -k: int
        +main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
    }

    class true_positive {
        -expected: set[str]
        -label2pool: callable
        -size: int
        -stat: str
        +main(records: list[KmerSeq]) -> bool
    }
Loading

Class diagram for new classes in synthetic_known.py

classDiagram
    class make_sample {
        -pool_sizes: dict
        -seq_len: int
        +main(num: int) -> c3_types.UnalignedSeqsType
    }

    class seqcoll_to_records {
        -k: int
        +main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
    }

    class true_positive {
        -expected: set[str]
        -label2pool: callable
        -size: int
        -stat: str
        +main(records: list[KmerSeq]) -> bool
    }

    class eval_condition {
        -num_reps
        -k
        -repeats
        -pools
        +main(settings: tuple[str, str, int]) -> c3_types.TabularType
    }
Loading

File-Level Changes

Change Details Files
Added scripts for benchmarking and performance evaluation
  • Added benchmark.py for measuring execution time of different commands
  • Added benchmark_ctree.py specifically for evaluating clustering tree performance
  • Implemented TimeIt and TempWorkingDir utility classes for benchmarking
paper/nbks/benchmark.py
paper/nbks/benchmark_ctree.py
Added scripts and notebooks for analyzing and visualizing results
  • Added Jupyter notebooks for analyzing JSD vs distance metrics
  • Added notebook for demonstrating plugin usage
  • Added notebook for analyzing benchmark results
  • Added notebook for analyzing clustering tree results
paper/nbks/jsd_v_dist.ipynb
paper/nbks/plugin_demo.ipynb
paper/nbks/benchmark.ipynb
paper/nbks/ctree.ipynb
Added supporting infrastructure for experiments
  • Added script for downloading required datasets
  • Added project path utilities for managing file locations
  • Added LaTeX files for generating algorithm figures
  • Created directory structure for results and figures
paper/nbks/get_data_sets.py
paper/nbks/project_path.py
paper/figs/max.tex
paper/figs/nmost.tex
Added scripts for clustering tree experiments
  • Added experiment.py for running clustering tree analysis
  • Added iq_experiment.py for IQ-TREE comparisons
  • Added likelihoods.py for computing tree likelihoods
paper/nbks/ctree/experiment.py
paper/nbks/ctree/iq_experiment.py
paper/nbks/ctree/likelihoods.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@GavinHuttley GavinHuttley merged commit ffd9a15 into HuttleyLab:JOSS Nov 11, 2024
10 checks passed
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @GavinHuttley - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return results


def run_max(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Refactor duplicate code between run_max() and run_nmost()

Extract the common functionality into a helper function that both run_max() and run_nmost() can use to reduce code duplication.

def _run_common(store_path: Path, stat, **kwargs):
    """Helper function for shared functionality between run_max and run_nmost"""
    results = {}
    # Common implementation here
    return results

def run_max(
    store_path: Path,
    stat,


For selecting representative sequences for large-scale analyses, we recommend use of the `nmost` command line tool. The choice of $k$ should be guided by the maximum number of unique $k$-mers in a DNA sequence of length $L$, indicated as the result of the expression $log(L/4)$. For instance, $k\approx 12$ for bacterial genomes (which are of the order $10^6$bp). For individual protein coding genes, as \autoref{fig:jsd-v-dist} indicates, $k=6$ for `nmost` gives a reasonable accuracy. For estimating a phylogeny, we recommend $k=12$ and a sketch size of $3000$.

# Availability
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (documentation): Consider adding specific installation command

It would be helpful to include the exact pip install command for users

Suggested change
# Availability
# Availability
`diverse-seq` can be installed from the Python Package Index using:
```pip install diverse-seq```
The GitHub repository for `diverse-seq` contains all the scripts used to generate the results in this work. A script for downloading the data sets used in this work, which are available from Zenodo **10.5281/zenodo.14052787**, is also included.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 11785936111

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.892%

Totals Coverage Status
Change from base Build 11770223146: 0.0%
Covered Lines: 1190
Relevant Lines: 1295

💛 - Coveralls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants