Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "taxon_by" param to *_frequencies_advanced() functions and allow the "period_by" param to specify a column name #694

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

leehart
Copy link
Collaborator

@leehart leehart commented Dec 10, 2024

Re: parent issue #391

Re: issues #722 and #723

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@leehart leehart changed the title Add taxon_by param to snp_allele_frequencies_advanced(). Add example … Add params to snp_allele_frequencies_advanced Dec 10, 2024
@leehart
Copy link
Collaborator Author

leehart commented Dec 13, 2024

Hi @alimanfoo . Do we know how we want to allow any value in the period_by column? Looking at the current code, it's not clear to me what we want to do here.

util.py currently has:

def prep_samples_for_cohort_grouping(
    *, df_samples, area_by, period_by, taxon_by="taxon"
):
    # Take a copy, as we will modify the dataframe.
    df_samples = df_samples.copy()

    # Fix "intermediate" or "unassigned" taxon values - we only want to build
    # cohorts with clean taxon calls, so we set other values to None.
    loc_intermediate_taxon = (
        df_samples[taxon_by].str.startswith("intermediate").fillna(False)
    )
    df_samples.loc[loc_intermediate_taxon, taxon_by] = None
    loc_unassigned_taxon = (
        df_samples[taxon_by].str.startswith("unassigned").fillna(False)
    )
    df_samples.loc[loc_unassigned_taxon, taxon_by] = None

    # Add period column.
    if period_by == "year":
        make_period = _make_sample_period_year
    elif period_by == "quarter":
        make_period = _make_sample_period_quarter
    elif period_by == "month":
        make_period = _make_sample_period_month
    else:  # pragma: no cover
        raise ValueError(
            f"Value for period_by parameter must be one of 'year', 'quarter', 'month'; found {period_by!r}."
        )
    sample_period = df_samples.apply(make_period, axis="columns")
    df_samples["period"] = sample_period

    # Add area column for consistent output.
    df_samples["area"] = df_samples[area_by]

    return df_samples

with

def _make_sample_period_month(row):
    year = row.year
    month = row.month
    if year > 0 and month > 0:
        return pd.Period(freq="M", year=year, month=month)
    else:
        return pd.NaT


def _make_sample_period_quarter(row):
    year = row.year
    month = row.month
    if year > 0 and month > 0:
        return pd.Period(freq="Q", year=year, month=month)
    else:
        return pd.NaT


def _make_sample_period_year(row):
    year = row.year
    if year > 0:
        return pd.Period(freq="Y", year=year)
    else:
        return pd.NaT

@leehart
Copy link
Collaborator Author

leehart commented Jan 17, 2025

@cclarkson @ahernank @jonbrenas Any thoughts or knowledge regarding the above question for this issue?

@jonbrenas
Copy link
Collaborator

Your guess is probably as good as mine. Maybe allow to use a column that already exists as "period" or a dictionary of queries? It might be worth asking for details at the next tools meeting.

@leehart leehart changed the title Add params to snp_allele_frequencies_advanced Add "taxon_by" param to snp_allele_frequencies_advanced() Feb 4, 2025
@leehart
Copy link
Collaborator Author

leehart commented Feb 4, 2025

Restricting this PR to just adding the "taxon_by" param.

I've split the original issue into two sub-issues, in order to try to address the period_by feature request separately later, when we have a clearer idea of a path forward for that.

@leehart
Copy link
Collaborator Author

leehart commented Feb 4, 2025

To do: add to other _frequencies_advanced functions.

@leehart leehart marked this pull request as draft February 4, 2025 14:26
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @leehart. Would be good to add this parameter to the other ..._frequencies_advanced() functions.

@leehart leehart changed the title Add "taxon_by" param to snp_allele_frequencies_advanced() Add "taxon_by" param to *_allele_frequencies_advanced() Feb 4, 2025
@leehart
Copy link
Collaborator Author

leehart commented Feb 4, 2025

To do: investigate and resolve potential merge conflict issue around _karyotype_tags_n_alt().

@leehart leehart changed the title Add "taxon_by" param to *_allele_frequencies_advanced() Add "taxon_by" param to *_frequencies_advanced() functions Feb 20, 2025
@leehart
Copy link
Collaborator Author

leehart commented Feb 21, 2025

Now that I have some clarity on the sibling issue #723 I'm tempted to include it under this PR, and then update the PR title accordingly.

…uping(). Add nb test. Remove old _make_sample_period_...() funcs from util.py.
@leehart leehart changed the title Add "taxon_by" param to *_frequencies_advanced() functions Add "taxon_by" param to *_frequencies_advanced() functions and allow the "period_by" param to specify a column name Feb 21, 2025
Copy link

codecov bot commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 75.00000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 95.97%. Comparing base (0197957) to head (3c2ce1d).

Files with missing lines Patch % Lines
malariagen_data/anoph/frq_base.py 61.53% 5 Missing ⚠️
malariagen_data/anoph/snp_frq.py 80.00% 2 Missing ⚠️
malariagen_data/anoph/cnv_frq.py 80.00% 1 Missing ⚠️
malariagen_data/anoph/hap_frq.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #694      +/-   ##
==========================================
- Coverage   96.13%   95.97%   -0.16%     
==========================================
  Files          47       47              
  Lines        4683     4699      +16     
==========================================
+ Hits         4502     4510       +8     
- Misses        181      189       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@leehart
Copy link
Collaborator Author

leehart commented Feb 21, 2025

To do: increase test coverage, in light of codecov test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants