Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix JC similarity #51

Merged
merged 5 commits into from
Mar 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Changelog

All notable changes to this project will be documented in this file.


## [0.8.2] - 2024-03-09

### Data

- Update to HPO 2024-03-09

### Refactor

- Update dependencies


## [0.8.1] - 2023-06-25

### Feature

- Derive `Clone` for `Ontology`


## [0.8.0] - 2023-05-22

### Feature

- Add method to calculate hypergeometric enrichment of genes and diseases in HpoSets
- Add method to create dendogram clusters based on similarity

### Refactor

- Allow custom Similarity implementations to use Matrix


## [0.7.1] - 2023-04-27

### Refactor

- Derive `Debug` trait on more public structs


## [0.7.0] - 2023-04-22

### Feature

- New method to retrieve the shortest path between two HpoTerm
- Add modifier flag and categories of HpoTerm

### Refactor

- Use SmallVec for HpoGroup with default size 30
- Add more benchmarks
- Improve performance for adding, or-ing and comparing HpoGroups


## [0.6.3] - 2023-04-11

### Bugfix

- Fix issue parsing new HPO masterdata format


## [0.6.2] - 2023-04-05

### Bugfix

- Fix Subontology to not include all parents or children

### Refactor

- Add benchmark tests for Criterion


## [0.6.1] - 2023-03-30

### Documentation

- Add plenty of documentation


## [0.6.0] - 2023-03-18

### Feature

- Replace obsolete terms in an HpoSet
- allow different versions of binary masterdata

### Refactor

- add stricter clippy rules
- switch from `log` to `tracing`


## [0.5.0] - 2023-03-07

### Refactor

- clean up Similarity methods
- Simplify iterators across the full crate and add new ones


## [0.4.2] - 2023-02-11

### Feature

- new similarity method: Mutation


## [0.4.0] - 2023-02-04

### Feature

- Create a sub-ontology
- Calculate hypergeometric enrichment

### Bugfix

- Collecting into a HpoGroup will maintain order of the IDs internally
32 changes: 32 additions & 0 deletions RELEASE_CHECKLIST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Release checklist

This document contains the workflows to follow for all changes and releases to `hpo`.
The worklow assures that the `main` branch always holds a functional version of `hpo` with all tests passing. The `main` branch can be ahead of the official `crates.io` release. New versions for `crates.io` releases are created independently of the regular updates and will contain all changes present in the `main` branch at that point. My goal is to automate the version bump and release process using Github Actions at some point.

This procedure is just a suggestion at this point and can be modified if needs arise.


## Regular updates / Normal development

- [ ] Develop in a dedicated branch (or your own fork): `git checkout -b <MY_FEATURE_NAME>`
- [ ] Rebase onto `main`: `git rebase main <MY_FEATURE_NAME>`
- [ ] Double check for good code, sensible API and well-explained docs
- [ ] Run format, clippy, tests and doc-generation: `cargo fmt --check && cargo clippy && cargo test && cargo doc`
- [ ] Push to remote: `git push -u origin <MY_FEATURE_NAME>`
- [ ] Create merge/pull request to `main` branch
- [ ] Once CICD passes, changes are merged to `main`


## Version bumps

- [ ] Make dedicated branch named after version: `git checkout main && git pull && git checkout -b release/<MAJOR>.<MINOR>.<PATCH>`
- [ ] Update Cargo.toml with new version
- [ ] Update dependencies if needed and possible
- [ ] Check if README or docs need update
- [ ] Add Changelog summary of changes
- [ ] Run format, clippy, tests and doc-generation: `cargo fmt --check && cargo clippy && cargo test && cargo doc`
- [ ] add git tag with version: `git tag v<MAJOR>.<MINOR>.<PATCH>`
- [ ] push to remote, also push tags: `git push -u origin release/<MAJOR>.<MINOR>.<PATCH> && git push tags`
- [ ] Merge into main
- [ ] update main branch locally: `git checkout main && git pull`
- [ ] release to cargo: `cargo release`
9 changes: 9 additions & 0 deletions examples/compare_similarities.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,17 @@ fn main() {
let sim1 = Builtins::new(&sim1_name, ic_kind).expect("invalid algoritm 1");
let sim2 = Builtins::new(&sim2_name, ic_kind).expect("invalid algoritm 2");

let mut n_terms = 100_000;

if let Some(n) = args.next() {
if let Ok(items) = n.parse::<usize>() {
n_terms = items;
}
}

let scores: Vec<String> = ontology
.into_iter()
.take(n_terms)
.par_bridge()
.map(|term1| {
let mut inner_score = Vec::new();
Expand Down
4 changes: 2 additions & 2 deletions examples/search_by_name.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ use hpo::Ontology;

fn main() {
let ontology = Ontology::from_binary("tests/ontology.hpo").unwrap();
let cystinosis = ontology.disease_by_name("Cystinosis").unwrap();
let cystinosis = ontology.omim_disease_by_name("Cystinosis").unwrap();
println!("first match: {:?}", cystinosis.name());
for result in ontology.diseases_by_name("Cystinosis") {
for result in ontology.omim_diseases_by_name("Cystinosis") {
println!("{:?}", result.name());
}
}
67 changes: 39 additions & 28 deletions src/ontology.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ use std::collections::hash_map::Values;
use std::collections::{HashMap, HashSet};
use std::fs::File;
use std::io::Read;
use std::iter::Filter;

use std::ops::BitOr;
use std::path::Path;

Expand Down Expand Up @@ -322,14 +322,27 @@ impl Debug for Ontology {
}
}

pub struct DiseaseIter<'a, F> {
inner: Filter<Values<'a, OmimDiseaseId, OmimDisease>, F>,
/// Iterates [`OmimDisease`] that match the query string
///
/// This struct is returned by [`Ontology::omim_diseases_by_name`]
pub struct OmimDiseaseFilter<'a> {
iter: Values<'a, OmimDiseaseId, OmimDisease>,
query: &'a str,
}

impl<'a> OmimDiseaseFilter<'a> {
fn new(iter: Values<'a, OmimDiseaseId, OmimDisease>, query: &'a str) -> Self {
OmimDiseaseFilter { iter, query }
}
}

impl<'a, F: FnMut(&&'a OmimDisease) -> bool + 'a> Iterator for DiseaseIter<'a, F> {
impl<'a> Iterator for OmimDiseaseFilter<'a> {
type Item = &'a OmimDisease;

fn next(&mut self) -> Option<Self::Item> {
self.inner.next()
self.iter
.by_ref()
.find(|&item| item.name().contains(self.query))
}
}

Expand Down Expand Up @@ -767,20 +780,12 @@ impl Ontology {
/// use hpo::Ontology;
/// let ontology = Ontology::from_binary("tests/example.hpo").unwrap();
///
/// for result in ontology.diseases_by_name("Cystinosis") {
/// for result in ontology.omim_diseases_by_name("Cystinosis") {
/// println!("{:?}", result.name());
/// }
/// ```
pub fn diseases_by_name<'a>(
&'a self,
substring: &'a str,
) -> DiseaseIter<impl FnMut(&&'a OmimDisease) -> bool + 'a> {
DiseaseIter {
inner: self
.omim_diseases
.values()
.filter(move |disease| disease.name().contains(substring)),
}
pub fn omim_diseases_by_name<'a>(&'a self, substring: &'a str) -> OmimDiseaseFilter {
OmimDiseaseFilter::new(self.omim_diseases.values(), substring)
}

/// Returns the first matching [`OmimDisease`] whose name contains the provided
Expand All @@ -794,9 +799,9 @@ impl Ontology {
/// use hpo::Ontology;
/// let ontology = Ontology::from_binary("tests/example.hpo").unwrap();
///
/// let cystinosis = ontology.disease_by_name("Cystinosis");
/// let cystinosis = ontology.omim_disease_by_name("Cystinosis");
/// ```
pub fn disease_by_name(&self, substring: &str) -> Option<&OmimDisease> {
pub fn omim_disease_by_name(&self, substring: &str) -> Option<&OmimDisease> {
self.omim_diseases
.values()
.find(|&disease| disease.name().contains(substring))
Expand Down Expand Up @@ -1815,25 +1820,31 @@ mod test {
#[test]
fn diseases_by_name() {
let ont = Ontology::from_binary("tests/example.hpo").unwrap();
assert_eq!(ont.diseases_by_name("Cystinosis").count(), 3);
assert_eq!(ont.diseases_by_name("Macdermot-Winter syndrome").count(), 1);
assert_eq!(ont.diseases_by_name("anergictcell syndrome").count(), 0);
assert_eq!(ont.omim_diseases_by_name("Cystinosis").count(), 3);
assert_eq!(
ont.omim_diseases_by_name("Macdermot-Winter syndrome")
.count(),
1
);
assert_eq!(
ont.omim_diseases_by_name("anergictcell syndrome").count(),
0
);

let cystinosis = vec![
let cystinosis = [
"Cystinosis, adult nonnephropathic",
"Cystinosis, late-onset juvenile or adolescent nephropathic",
"Cystinosis, nephropathic",
];
assert!(cystinosis.contains(&ont.omim_disease_by_name("Cystinosis").unwrap().name()));

assert_eq!(
cystinosis.contains(&ont.disease_by_name("Cystinosis").unwrap().name()),
true
);
assert_eq!(
ont.disease_by_name("Macdermot-Winter syndrome")
ont.omim_disease_by_name("Macdermot-Winter syndrome")
.unwrap()
.name(),
"Macdermot-Winter syndrome"
);
assert_eq!(ont.disease_by_name("anergictcell syndrome").is_none(), true);

assert!(ont.omim_disease_by_name("anergictcell syndrome").is_none());
}
}
24 changes: 17 additions & 7 deletions src/similarity/defaults.rs
Original file line number Diff line number Diff line change
Expand Up @@ -142,13 +142,19 @@ impl Similarity for Lin {

/// Similarity score from Jiang & Conrath
///
/// For a detailed description see [Jiang J, Conrath D, ROCLING X, (1997)](https://aclanthology.org/O97-1002.pdf)
/// For a detailed description see [Jiang J, Conrath D, Rocling X, (1997)](https://aclanthology.org/O97-1002.pdf)
///
/// # Note
///
/// This algorithm is an implementation as described in the paper cited above. It is different
/// from the `JC` implementation in the `HPOSim` R library. It is identical to the `JC2`
/// implementation in [`PyHPO`](https://pypi.org/project/pyhpo/)
/// This algorithm is an implementation as described in the paper cited above, with minor
/// modifications. It is different from the `JC` implementation in the `HPOSim` R library.
/// For a discussion on the correct implementation see
/// [this issue from pyhpo](https://github.com/anergictcell/pyhpo/issues/20).
///
/// # Note
///
/// The logic of the JC similarity was changed in version `0.8.3`. Ensure you update
/// to at least `0.8.3` before using it.
#[derive(Debug)]
pub struct Jc {
kind: InformationContentKind,
Expand Down Expand Up @@ -179,12 +185,16 @@ impl Similarity for Jc {
return 1.0;
}

let ic_combined = a.information_content().get_kind(&self.kind)
+ b.information_content().get_kind(&self.kind);
let ic1 = a.information_content().get_kind(&self.kind);
let ic2 = b.information_content().get_kind(&self.kind);

if ic1 == 0.0 || ic2 == 0.0 {
return 0.0;
}

let resnik = Resnik::new(self.kind).calculate(a, b);

1.0 - (ic_combined - 2.0 * resnik)
1.0 / (ic1 + ic2 - 2.0 * resnik + 1.0)
}
}

Expand Down
20 changes: 10 additions & 10 deletions src/stats.rs
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,16 @@ impl<K: Clone> Iterator for Counts<'_, K> {
}
}

/// We have to frequently do divisions starting with u64 values
/// and need to return f64 values. To ensure some kind of safety
/// we use this method to panic in case of overflows.
fn f64_from_u64(n: u64) -> f64 {
let intermediate: u32 = n
.try_into()
.expect("cannot safely create f64 from large u64");
intermediate.into()
}

#[cfg(test)]
mod test {
use super::*;
Expand Down Expand Up @@ -263,13 +273,3 @@ mod test {
assert!(iter.next().is_none());
}
}

/// We have to frequently do divisions starting with u64 values
/// and need to return f64 values. To ensure some kind of safety
/// we use this method to panic in case of overflows.
fn f64_from_u64(n: u64) -> f64 {
let intermediate: u32 = n
.try_into()
.expect("cannot safely create f64 from large u64");
intermediate.into()
}
Loading