-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linted paper scripts and notebooks #81
Conversation
This was written collaboartively by myself, Katherine Caley and Robert McArthur in a separate repo located here https://github.com/GavinHuttley/DiverseSeqPaper That repo is now archived and all subsequent edits will go via the main DiverseSeq repository.
Reviewer's Guide by SourceryThis PR adds Python scripts and Jupyter notebooks for generating figures and analyzing results for the paper. The changes include scripts for benchmarking performance, evaluating clustering algorithms, and generating visualizations of the results. Class diagram for new classes in jsd_v_dist.pyclassDiagram
class min_dist {
-dists
-num
+__init__(dists)
+__call__(names: list[str]) -> float
}
class compare_sets {
-dist_size: int
-dvgt
+__init__(app, dist_size: int)
+main(aln: c3_types.AlignedSeqsType) -> dict
}
class make_sample {
-pool_sizes: dict
-seq_len: int
+main(num: int) -> c3_types.UnalignedSeqsType
}
class seqcoll_to_records {
-k: int
+main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
}
class true_positive {
-expected: set[str]
-label2pool: callable
-size: int
-stat: str
+main(records: list[KmerSeq]) -> bool
}
Class diagram for new classes in synthetic_known.pyclassDiagram
class make_sample {
-pool_sizes: dict
-seq_len: int
+main(num: int) -> c3_types.UnalignedSeqsType
}
class seqcoll_to_records {
-k: int
+main(seqs: c3_types.UnalignedSeqsType) -> list[KmerSeq]
}
class true_positive {
-expected: set[str]
-label2pool: callable
-size: int
-stat: str
+main(records: list[KmerSeq]) -> bool
}
class eval_condition {
-num_reps
-k
-repeats
-pools
+main(settings: tuple[str, str, int]) -> c3_types.TabularType
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @GavinHuttley - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟡 Documentation: 1 issue found
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
return results | ||
|
||
|
||
def run_max( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Refactor duplicate code between run_max() and run_nmost()
Extract the common functionality into a helper function that both run_max() and run_nmost() can use to reduce code duplication.
def _run_common(store_path: Path, stat, **kwargs):
"""Helper function for shared functionality between run_max and run_nmost"""
results = {}
# Common implementation here
return results
def run_max(
store_path: Path,
stat,
|
||
For selecting representative sequences for large-scale analyses, we recommend use of the `nmost` command line tool. The choice of $k$ should be guided by the maximum number of unique $k$-mers in a DNA sequence of length $L$, indicated as the result of the expression $log(L/4)$. For instance, $k\approx 12$ for bacterial genomes (which are of the order $10^6$bp). For individual protein coding genes, as \autoref{fig:jsd-v-dist} indicates, $k=6$ for `nmost` gives a reasonable accuracy. For estimating a phylogeny, we recommend $k=12$ and a sketch size of $3000$. | ||
|
||
# Availability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (documentation): Consider adding specific installation command
It would be helpful to include the exact pip install command for users
# Availability | |
# Availability | |
`diverse-seq` can be installed from the Python Package Index using: | |
```pip install diverse-seq``` | |
The GitHub repository for `diverse-seq` contains all the scripts used to generate the results in this work. A script for downloading the data sets used in this work, which are available from Zenodo **10.5281/zenodo.14052787**, is also included. |
Pull Request Test Coverage Report for Build 11785936111Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Summary by Sourcery
Lint and refactor paper scripts and notebooks to enhance code quality and maintainability, and add comprehensive documentation for the 'diverse-seq' application.
Enhancements:
Documentation: