Skip to content

Commit

Permalink
make sure that every gene base exists once in gene id file (addressin… (
Browse files Browse the repository at this point in the history
#155)

* make sure that every gene base exists once in gene id file (addressing #154)

* fixup! Format Python code with psf/black pull_request

---------

Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
Co-authored-by: PMBio <PMBio@users.noreply.github.com>
  • Loading branch information
3 people authored Feb 12, 2025
1 parent a85ee57 commit 0d91fc9
Show file tree
Hide file tree
Showing 4 changed files with 9 additions and 0 deletions.
4 changes: 4 additions & 0 deletions deeprvat/annotations/annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -2022,6 +2022,10 @@ def create_gene_id_file(gtf_filepath: str, out_file: str):
.reset_index()
.rename(columns={"gene_id": "gene", "index": "id"})
)
cols = gtf.columns
gtf[["gene_base", "feature"]] = gtf["gene"].str.split(".", expand=True)
gtf.drop_duplicates(subset=["gene_base"], inplace=True)
gtf = gtf[cols]
gtf.to_parquet(out_file)


Expand Down
5 changes: 5 additions & 0 deletions tests/annotations/test_annotations.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,11 @@ def test_calculate_maf(test_data_name_dir, annotations, expected, tmp_path):
"gencode.v44.annotation.gtf.gz",
"protein_coding_genes.parquet",
),
(
"create_gene_id_file_GRCh37_47",
"gencode.v47lift37.basic.annotation.gtf.gz",
"protein_coding_genes.parquet",
),
],
)
def test_create_gene_id_file(test_data_name_dir, gtf_file, expected, tmp_path):
Expand Down
Binary file not shown.
Binary file not shown.

0 comments on commit 0d91fc9

Please sign in to comment.