Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two small bugfixes to 02-EvidenceQC #666

Merged
merged 1 commit into from
Apr 29, 2024
Merged

Two small bugfixes to 02-EvidenceQC #666

merged 1 commit into from
Apr 29, 2024

Conversation

RCollins13
Copy link
Contributor

When running EvidenceQC.wdl on ~30k samples from the NIH AllOfUs cohort, I encountered two unrelated issues with the MakeQcTable task in EvidenceQC.wdl:

  1. EvidenceQC.wdl supports optionally disabling running VCF QC but the read_all_outlier() function in make_evidence_qc_table.py exits with an error when there are strictly zero outlier samples:
Traceback (most recent call last):
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
    main()
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
    merge_evidence_qc_table(
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 178, in merge_evidence_qc_table
    df_total_high_outliers = read_all_outlier(df_manta_high_outlier, df_melt_high_outlier, df_wham_high_outlier, "high")
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 150, in read_all_outlier
    all_outliers_df.columns = [ID_COL, outlier_type + "_overall_outliers"]
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 5588, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/generic.py", line 769, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 214, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/internals/base.py", line 69, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

I solved this by adding a conditional statement to check if there are zero outliers, in which case the function returns an empty dataframe with the expected headers (and this allows the rest of the script to run successfully).

  1. Dataframe merging in merge_evidence_qc_table() fails for cohorts where every sample has an integer ID. This seems to be due to pandas coercing some of the ID columns to dtype object whereas some are dtype int64 leading to this error:
Traceback (most recent call last):
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 269, in <module>
    main()
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 253, in main
    merge_evidence_qc_table(
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in merge_evidence_qc_table
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
  File "/opt/sv-pipeline/scripts/make_evidence_qc_table.py", line 190, in <lambda>
    output_df = reduce(lambda left, right: pd.merge(left, right, on=ID_COL, how="outer"), dfs)
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
    self._maybe_coerce_merge_keys()
  File "/opt/conda/envs/gatk-sv/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

I fixed this by forcing all ID columns to dtype object prior to merging, which resolves this error.

(Both of these were encountered when using Docker image us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2024-03-04-v0.28.4-beta-f0ad3f0f, but based on the edit history of make_evidence_qc_table.py my impression is these should reflect the current main branch)

Thanks!
Ryan

Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RCollins13! Just a reminder that sample IDs shouldn't be numeric so make sure you change those by GatherBatchEvidence. See https://github.com/broadinstitute/gatk-sv?tab=readme-ov-file#sampleids

@mwalker174 mwalker174 merged commit 10c8a22 into broadinstitute:main Apr 29, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants