Linted paper scripts and notebooks #81

GavinHuttley · 2024-11-11T20:46:56Z

Summary by Sourcery

Lint and refactor paper scripts and notebooks to enhance code quality and maintainability, and add comprehensive documentation for the 'diverse-seq' application.

Enhancements:

Lint and refactor paper scripts and notebooks to improve code quality and maintainability.

Documentation:

Add comprehensive documentation for the 'diverse-seq' application, detailing its features, usage, and performance.

This was written collaboartively by myself, Katherine Caley and Robert McArthur in a separate repo located here https://github.com/GavinHuttley/DiverseSeqPaper That repo is now archived and all subsequent edits will go via the main DiverseSeq repository.

sourcery-ai · 2024-11-11T20:46:59Z

Reviewer's Guide by Sourcery

This PR adds Python scripts and Jupyter notebooks for generating figures and analyzing results for the paper. The changes include scripts for benchmarking performance, evaluating clustering algorithms, and generating visualizations of the results.

Class diagram for new classes in jsd_v_dist.py

classDiagram
    class min_dist {
        -dists
        -num
        +__init__(dists)
        +__call__(names: list[str]) -> float
    }

    class compare_sets {
        -dist_size: int
        -dvgt
        +__init__(app, dist_size: int)
        +main(aln: c3_types.AlignedSeqsType) -> dict
    }

    class make_sample {
        -pool_sizes: dict
        -seq_len: int
        +main(num: int) -> c3_types.UnalignedSeqsType
    }

    class seqcoll_to_records {
        -k: int
        +main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
    }

    class true_positive {
        -expected: set[str]
        -label2pool: callable
        -size: int
        -stat: str
        +main(records: list[KmerSeq]) -> bool
    }

Class diagram for new classes in synthetic_known.py

classDiagram
    class make_sample {
        -pool_sizes: dict
        -seq_len: int
        +main(num: int) -> c3_types.UnalignedSeqsType
    }

    class seqcoll_to_records {
        -k: int
        +main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
    }

    class true_positive {
        -expected: set[str]
        -label2pool: callable
        -size: int
        -stat: str
        +main(records: list[KmerSeq]) -> bool
    }

    class eval_condition {
        -num_reps
        -k
        -repeats
        -pools
        +main(settings: tuple[str, str, int]) -> c3_types.TabularType
    }

File-Level Changes

Change	Details	Files
Added scripts for benchmarking and performance evaluation	Added benchmark.py for measuring execution time of different commands Added benchmark_ctree.py specifically for evaluating clustering tree performance Implemented TimeIt and TempWorkingDir utility classes for benchmarking	`paper/nbks/benchmark.py` `paper/nbks/benchmark_ctree.py`
Added scripts and notebooks for analyzing and visualizing results	Added Jupyter notebooks for analyzing JSD vs distance metrics Added notebook for demonstrating plugin usage Added notebook for analyzing benchmark results Added notebook for analyzing clustering tree results	`paper/nbks/jsd_v_dist.ipynb` `paper/nbks/plugin_demo.ipynb` `paper/nbks/benchmark.ipynb` `paper/nbks/ctree.ipynb`
Added supporting infrastructure for experiments	Added script for downloading required datasets Added project path utilities for managing file locations Added LaTeX files for generating algorithm figures Created directory structure for results and figures	`paper/nbks/get_data_sets.py` `paper/nbks/project_path.py` `paper/figs/max.tex` `paper/figs/nmost.tex`
Added scripts for clustering tree experiments	Added experiment.py for running clustering tree analysis Added iq_experiment.py for IQ-TREE comparisons Added likelihoods.py for computing tree likelihoods	`paper/nbks/ctree/experiment.py` `paper/nbks/ctree/iq_experiment.py` `paper/nbks/ctree/likelihoods.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @GavinHuttley - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟡 General issues: 1 issue found
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟡 Documentation: 1 issue found

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2024-11-11T20:48:06Z

paper/nbks/jsd_v_dist.py

+    return results
+
+
+def run_max(


suggestion: Refactor duplicate code between run_max() and run_nmost()

Extract the common functionality into a helper function that both run_max() and run_nmost() can use to reduce code duplication.

def _run_common(store_path: Path, stat, **kwargs): """Helper function for shared functionality between run_max and run_nmost""" results = {} # Common implementation here return results def run_max( store_path: Path, stat,

sourcery-ai · 2024-11-11T20:48:06Z

paper/paper.md

+
+For selecting representative sequences for large-scale analyses, we recommend use of the `nmost` command line tool. The choice of $k$ should be guided by the maximum number of unique $k$-mers in a DNA sequence of length $L$, indicated as the result of the expression $log(L/4)$. For instance, $k\approx 12$ for bacterial genomes (which are of the order $10^6$bp). For individual protein coding genes, as \autoref{fig:jsd-v-dist} indicates, $k=6$ for `nmost` gives a reasonable accuracy. For estimating a phylogeny, we recommend $k=12$ and a sketch size of $3000$.
+
+# Availability


suggestion (documentation): Consider adding specific installation command

It would be helpful to include the exact pip install command for users

Suggested change

# Availability

# Availability

`diverse-seq` can be installed from the Python Package Index using:

```pip install diverse-seq```

The GitHub repository for `diverse-seq` contains all the scripts used to generate the results in this work. A script for downloading the data sets used in this work, which are available from Zenodo **10.5281/zenodo.14052787**, is also included.

coveralls · 2024-11-11T20:58:32Z

Pull Request Test Coverage Report for Build 11785936111

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 91.892%

Totals
Change from base Build 11770223146:	0.0%
Covered Lines:	1190
Relevant Lines:	1295

💛 - Coveralls

GavinHuttley added 2 commits November 11, 2024 12:31

STY: ruff applied to paper scripts and notebooks

4bbb3f9

GavinHuttley merged commit ffd9a15 into HuttleyLab:JOSS Nov 11, 2024
10 checks passed

sourcery-ai bot reviewed Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linted paper scripts and notebooks #81

Linted paper scripts and notebooks #81

GavinHuttley commented Nov 11, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 11, 2024 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Nov 11, 2024

sourcery-ai bot Nov 11, 2024

coveralls commented Nov 11, 2024


		For selecting representative sequences for large-scale analyses, we recommend use of the `nmost` command line tool. The choice of $k$ should be guided by the maximum number of unique $k$-mers in a DNA sequence of length $L$, indicated as the result of the expression $log(L/4)$. For instance, $k\approx 12$ for bacterial genomes (which are of the order $10^6$bp). For individual protein coding genes, as \autoref{fig:jsd-v-dist} indicates, $k=6$ for `nmost` gives a reasonable accuracy. For estimating a phylogeny, we recommend $k=12$ and a sketch size of $3000$.

		# Availability

-# Availability
+# Availability
+`diverse-seq` can be installed from the Python Package Index using:
+```pip install diverse-seq```
+The GitHub repository for `diverse-seq` contains all the scripts used to generate the results in this work. A script for downloading the data sets used in this work, which are available from Zenodo **10.5281/zenodo.14052787**, is also included.

Linted paper scripts and notebooks #81

Linted paper scripts and notebooks #81

Conversation

GavinHuttley commented Nov 11, 2024 • edited by sourcery-ai bot Loading

Summary by Sourcery

sourcery-ai bot commented Nov 11, 2024 • edited Loading

Reviewer's Guide by Sourcery

Class diagram for new classes in jsd_v_dist.py

Class diagram for new classes in synthetic_known.py

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Nov 11, 2024

Choose a reason for hiding this comment

sourcery-ai bot Nov 11, 2024

Choose a reason for hiding this comment

coveralls commented Nov 11, 2024

Pull Request Test Coverage Report for Build 11785936111

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

GavinHuttley commented Nov 11, 2024 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Nov 11, 2024 •

edited

Loading