diff --git a/.gitignore b/.gitignore index 53176ba..f50a47e 100644 --- a/.gitignore +++ b/.gitignore @@ -30,3 +30,5 @@ databases/pharmGKB/CREATED_* databases/string/protein.links.full.txt.gz databases/gene_ontology/logs.json databases/string/protein.info.txt.gz +databases/gene_info/gene_info_grch37.csv +databases/gene_info/gene_info_grch38.csv diff --git a/DEPLOYING.md b/DEPLOYING.md index e8c75d3..65e308a 100644 --- a/DEPLOYING.md +++ b/DEPLOYING.md @@ -2,22 +2,22 @@ Below are the steps to perform a production deploy of BioAPI. - ## Requirements - BioAPI deployment was configured to be simple from the Docker Compose tool. So you need to install: - - [docker](https://docs.docker.com/desktop/#download-and-install) - - [docker-compose](https://docs.docker.com/compose/install/) - + - [docker](https://docs.docker.com/desktop/#download-and-install) + - [docker-compose](https://docs.docker.com/compose/install/) ## Instructions 1. Create MongoDB Docker volumes: + ```bash docker volume create --name=bio_api_mongo_data docker volume create --name=bio_api_mongo_config docker volume create --name=bio_api_logs ``` + 2. Make a copy of `docker-compose_dist.yml` with the name `docker-compose.yml` and set the environment variables that are empty with data. They are listed below by category: - MongoDB Server: - `MONGO_INITDB_ROOT_USERNAME` and `MONGO_INITDB_ROOT_PASSWORD`: These variables are the username and password that will be created in MongoDB for the "*admin*" database. @@ -28,16 +28,14 @@ Below are the steps to perform a production deploy of BioAPI. 3. (Optional) Optimize Mongo by changing the configuration in the `config/mongo/mongod.conf` file and uncommenting the reference in the `docker-compose.yml` and/or `docker-compose.dev.yml`. 4. Start up all the services with Docker Compose running `docker compose up -d` to check that It's all working, and read the instructions in the following section to import the genomics databases. - ## Importing genomic data BioAPI uses three genomic databases for its operation. These databases must be loaded in MongoDB. You can import all the databases in two ways: - ### Import using public DB backup (recommended) To import all databases in MongoDB: - + 1. Download the "bioapi_db.gz" from **[here](https://drive.google.com/file/d/1oBdhC-XoJn-VNEIEpfMWB2Gna--WZ1Wa/view?usp=sharing)** 2. Shutdown all the services running `docker compose down` 3. Edit the `docker-compose.dev.yml` file to include the downloaded file inside the container: @@ -51,24 +49,26 @@ To import all databases in MongoDB: - /path/to/bioapi_db.gz:/bioapi_db.gz # ... ``` - Where "/path/to/" is the absolute path of the "bioapi_db.gz" file downloaded on step 1. -4. Start up the DB service running `docker compose -f docker-compose.dev.yml up -d mongo_bioapi` + Where "/path/to/" is the absolute path of the "bioapi_db.gz" file downloaded on step 1. +4. Start up the services again running `docker compose up -d` 5. Go inside the container `docker container exec -it bio_api_mongo_db bash` 6. Use Mongorestore to import it into MongoDB: - ```bash + +```bash mongorestore --username --password --authenticationDatabase admin --gzip --archive=/bioapi_db.gz - ``` - Where *\*, *\* are the preconfigured credentials to MongoDB in the `docker-compose.dev.yml` file. *bioapi_db.gz* is the file downloaded in the previous step. **Keep in mind that this loading process will import approximately *47 GB* of information into MongoDB, so it may take a while**. +``` + + Where *\*, *\* are the preconfigured credentials to MongoDB in the `docker-compose.yml` file. *bioapi_db.gz* is the file downloaded in the previous step. **Keep in mind that this loading process will import approximately *47 GB* of information into MongoDB, so it may take a while**. + 7. Stop services with the command `docker compose -f docker-compose.dev.yml down` 8. Rollup the changes in `docker-compose.dev.yml` file to remove the backup file from the `volumes` section. Restart all the services again. - ### Manually import the different databases Alternatively (but **not recommended** due to high computational demands) you can run a separate ETL process to download from source, process and import the databases into MongoDB. 1. Install the necessary requirements: - - [R language](https://www.r-project.org/). Version 4.1.2 or later (Only necessary if you want to update the Gene information database from Ensembl) + - [R language](https://www.r-project.org/). Version 4.3.2 (Only necessary if you want to update the Gene information database from Ensembl and CiVIC) - Some python packages. They can be installed using: `pip install -r config/genomic_db_conf/requirements.txt` 2. The ETL process is programmed in a single bash script for each database. Edit in the bash file of the database that you want to update the **user** and **password** parameters, using the same values that you set in the `docker-compose.yml` file. Bash files can be found in the *'databases'* folder, within the corresponding directories for each database: @@ -77,19 +77,18 @@ Alternatively (but **not recommended** due to high computational demands) you ca - For Gene expression ([Genotype-Tissue Expression (GTEx)](https://gtexportal.org/home/)) use "databases/gtex" directory and the *gtex2mongodb.sh* file. - For Gene information ([Ensembl genomic data](https://www.ensembl.org/biomart/martview/), [RefSeq gene summaries](https://www.ncbi.nlm.nih.gov/refseq/), and [CiVIC gene descriptions](https://civicdb.org/welcome)) use "databases/gene_info" directory and the *geneinfo2mongodb.sh* file. - For Oncokb cancer genes and drug information, it is necessary to download 3 datasets from their [official site](https://www.oncokb.org/) (**registration required**) and place it within the directory "databases/oncokb": - - *Therapeutic, Diagnostic, and Prognostic dataset*: Download this dataset from [Actionable Genes page](https://www.oncokb.org/actionableGenes) by clicking the _Association_ button. Save it with the name "oncokb_biomarker_drug_associations.tsv". - - *Cancer Genes dataset*: Download the dataset from the [Cancer Genes](https://www.oncokb.org/cancerGenes) by clicking the _Cancer Gene List_ button. Save it with the name "cancer_gene_list.tsv". - - *Precision Oncology Therapies dataset*: Download this dataset from [Precision Oncology Therapies page](https://www.oncokb.org/precision-oncology-therapies) by clicking the _Download Table_ button. Save it with the name "oncokb_precision_oncology_therapies.tsv". - - To import all this dataset to MongoDB, execute the oncokb2mongodb.sh script. - - For cancer related drugs ([Pharmacogenomics Knowledge Base (PharmGKB) ](https://www.pharmgkb.org/)) use "databases\pharmGKB" directory and the *pharmgkb2mongodb.sh* file. - - For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/)) use "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases + - *Therapeutic, Diagnostic, and Prognostic dataset*: Download this dataset from [Actionable Genes page](https://www.oncokb.org/actionableGenes) by clicking the *Association* button. Save it with the name "oncokb_biomarker_drug_associations.tsv". + - *Cancer Genes dataset*: Download the dataset from the [Cancer Genes](https://www.oncokb.org/cancerGenes) by clicking the *Cancer Gene List* button. Save it with the name "cancer_gene_list.tsv". + - *Precision Oncology Therapies dataset*: Download this dataset from [Precision Oncology Therapies page](https://www.oncokb.org/precision-oncology-therapies) by clicking the *Download Table* button. Save it with the name "oncokb_precision_oncology_therapies.tsv". To import all this dataset to MongoDB, execute the oncokb2mongodb.sh script. + - For cancer related drugs ([Pharmacogenomics Knowledge Base (PharmGKB)](https://www.pharmgkb.org/)) use "databases\pharmGKB" directory and the *pharmgkb2mongodb.sh* file. + - For Gene ontology ([Gene Ontology (GO)](http://geneontology.org/)) use "databases\gene_ontology" directory and the *go2mongodb.sh* file. **NOTE:** This import needs the "Gene nomenclature" databases (2) already imported to properly process the gene ontology databases - For predicted functional associations network (String) it is necessary to download some datasets from their [official site](https://string-db.org/cgi/download), make sure that the **selected organism is Homo Sapiens** (the file sizes should be in Mb), from "INTERACTION DATA" download "protein network data (full network, incl. distinction: direct vs. interologs)" and rename it to "protein.links.full.txt.gz" then from "ACCESSORY DATA" download "list of STRING proteins incl. their display names and descriptions" and rename it to "protein.aliases.txt.gz", place the 2 files in the "databases\string". 3. Run bash files. `./` where file.sh can be *cpdb2mongodb.sh*, *hgnc2mongodb.sh*, *gtex2mongodb.sh*, *go2mongodb.sh*, *string2mongodb.sh*, *pharmgkb2mongodb.sh*, or *ensembl_gene2mongodb.sh*, as appropriate. ## Run BioAPI + Use docker compose to get the BioAPI up: ```bash @@ -104,7 +103,6 @@ If you want to stop all services, you can execute: docker-compose down ``` - ### See container status To check the different services' status you can run: @@ -115,13 +113,14 @@ docker-compose logs Where *\* could be `nginx_bioapi`, `web_bioapi` or `mongo_bioapi`. - ## Update genomic databases + If new versions are released for the genomic databases included in BioAPI, you can update them by following the instructions below: + - For the "Metabolic pathways (ConsensusPathDB)", "Gene nomenclature (HUGO Gene Nomenclature Committee)", "Gene ontology (GO)", "Cancer related drugs (PharmGKB)","Gene information (from Ensembl and CiVIC)" and "Cancer and Actionable genes (OncoKB)" databases, it is not necessary to make any modifications to any script. This is because the datasets are automatically downloaded in their most up-to-date versions when the bash file for each database is executed as described in the **Manually import the different databases** section of this file. -**Important notes**: +**Important notes**: - For OncoKB the download is not automatic since it requires registration, but the steps to download them manually are explained in the same section mentioned above. - - For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes. + - For RefSeq gene summaries, the R package [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) is used. The update of the database will depend on the version that the package includes. - For String the download is not automatic, but the steps to download them manually are explained in the same section mentioned above. - If you need to update the "Gene expression (Genotype-Tissue Expression)" database, you should also follow the procedures in the section named above, but first you should edit the bash file as follows: 1. Modify the **gtex2mongodb.sh** file. Edit the variables *"expression_url"* and *"annotation_url"*. @@ -129,4 +128,35 @@ If new versions are released for the genomic databases included in BioAPI, you c 1. In the *"annotation_url"* variable, set the url corresponding to the file that contains the annotated samples and allows finding the corresponding tissue type for each sample in the database. By default, GTEx is being used in its version [GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2)](https://gtexportal.org/home/datasets#datasetDiv1) -**NOTE:** It is NOT necessary to drop the MongoDB database before upgrading (this applies to all databases). +**NOTE:** It is NOT necessary to drop the MongoDB database before upgrading (this applies to all databases). + +### Export image file from database + +Finally, if you want to create a new image of MongoDB data, you can follow the following steps: + +1. Shutdown all the services running `docker compose down` +2. Edit the `docker-compose.dev.yml` file to link a container folder to a folder on your computer: + + ```yml + # ... + mongo: + image: mongo:6.0.2 + # ... + volumes: + # ... + - /path/in/your/computer:/app + # ... + ``` + + Where "/path/in/your/computer" is the absolute path to the directory on your computer where the mongodb image will be created +3. Start up the services of BioAPI running `docker compose up -d` +4. Go inside the container `docker container exec -it bio_api_mongo_db bash` +5. Use mongodump to export the data to a file: + +```bash + mongodump --username --password --authenticationDatabase admin --host localhost --port 27017 --gzip --db bio_api --archive=/app/bioapi_db.gz +``` + +**NOTE**: The process can take a few hours + +La nueva imagen podra encontrarla en *"/path/in/your/computer/bioapi_db.gz"* diff --git a/README.md b/README.md index 7b274bc..86176a8 100644 --- a/README.md +++ b/README.md @@ -4,25 +4,27 @@ A powerful abstraction of genomics databases. Bioapi is a REST API that provides This document is focused on the **development** of the system. If you are looking for documentation for a production deployment see [DEPLOYING.md](DEPLOYING.md). - ## Integrated databases BioAPI obtains information from different bioinformatics databases. These databases were installed locally to reduce data search time. The databases currently integrated to BioAPI are: + 1. Gene nomenclature: [HUGO Gene Nomenclature Committee](https://www.genenames.org/). HGNC is the resource for approved human gene nomenclature. Downloaded from its official website in September 2022. 2. Gene information: - - [ENSEMBL](http://www.ensembl.org/biomart/martview): BioMart data mining tool was used to obtain a gene-related dataset from Ensembl. Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotates genes, computes multiple alignments, predicts regulatory function and collects disease data. Downloaded using *BioMart data mining tool* in September 2022. - - [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): the summary of each human gene was obtained from the RefSeq database. RefSeq (Reference Sequence) is the public database of annotated and curated nucleic acid (DNA and RNA) and protein sequences from the National Center for Biotechnology Information (NCBI). To obtain the summaries, the R package called [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) was used, which obtains the abstracts from version 214 of RefSeq. - - [CiVIC](https://civicdb.org/welcome): a description of the genes oriented to clinical interpretation in cancer was obtained from the CiVIC database, an open-source platform supporting crowd sourced and expert-moderated cancer variant curation. The database was downloaded from its official website in April 2023. + + - [ENSEMBL](http://www.ensembl.org/biomart/martview): BioMart data mining tool was used to obtain a gene-related dataset from Ensembl. Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotates genes, computes multiple alignments, predicts regulatory function and collects disease data. Downloaded using *BioMart data mining tool* in September 2022. + - [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/): the summary of each human gene was obtained from the RefSeq database. RefSeq (Reference Sequence) is the public database of annotated and curated nucleic acid (DNA and RNA) and protein sequences from the National Center for Biotechnology Information (NCBI). To obtain the summaries, the R package called [GeneSummary](https://bioconductor.org/packages/release/data/annotation/html/GeneSummary.html) was used, which obtains the abstracts from version 214 of RefSeq. + - [CiVIC](https://civicdb.org/welcome): a description of the genes oriented to clinical interpretation in cancer was obtained from the CiVIC database, an open-source platform supporting crowd sourced and expert-moderated cancer variant curation. The database was downloaded from its official website in April 2023. + 3. Metabolic pathways: [ConsensusPathDB](http://cpdb.molgen.mpg.de/). -ConsensusPathDB-human integrates interaction networks in Homo sapiens including binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 31 public resources for interactions (listed below) and interactions that we have curated from the literature. The interaction data are integrated in a complementary manner (avoiding redundancies), resulting in a seamless interaction network containing different types of interactions. Downloaded from its official website in September 2022. +ConsensusPathDB-human integrates interaction networks in Homo sapiens including binary and complex protein-protein, genetic, metabolic, signaling, gene regulatory and drug-target interactions, as well as biochemical pathways. Data originate from currently 31 public resources for interactions (listed below) and interactions that we have curated from the literature. The interaction data are integrated in a complementary manner (avoiding redundancies), resulting in a seamless interaction network containing different types of interactions. Downloaded from its official website in September 2022. 4. Gene expression: [Genotype-Tissue Expression (GTEx)](https://gtexportal.org/home/). The Genotype-Tissue Expression (GTEx) project is an ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation. Samples were collected from 54 non-diseased tissue sites across nearly 1000 individuals, primarily for molecular assays including WGS, WES, and RNA-Seq. GTEx is being used in its version [GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2)](https://gtexportal.org/home/datasets#datasetDiv1) and was downloaded from its official website in September 2022. 5. Therapies and actionable genes in cancer: [OncoKB](https://www.oncokb.org/). OncoKB™ is a precision oncology knowledge base developed at Memorial Sloan Kettering Cancer Center that contains biological and clinical information about genomic alterations in cancer. Alteration- and tumor type-specific therapeutic implications are classified using the OncoKB™ [Levels of Evidence system](https://www.oncokb.org/levels), which assigns clinical actionability to individual mutational events. Downloaded from its official website in November 2023. 6. Gene Ontology [Gene Ontology (GO)](http://geneontology.org/). It is a project to develop an up-to-date, comprehensive, computational model of biological systems, from the molecular level to larger pathways, cellular and organism-level systems. It provides structured and standardized annotations of gene products, in a hierarchical system of terms and relationships that describes the molecular functions, biological processes, and cellular components associated with genes and gene products. Downloaded from its official website in June 2023 -7. Cancer related drugs [Pharmacogenomics Knowledge Base (PharmGKB) ](https://www.pharmgkb.org/). +7. Cancer related drugs [Pharmacogenomics Knowledge Base (PharmGKB)](https://www.pharmgkb.org/). It is a resource that provides information about how human genetic variation affects response to medications. PharmGKB collects, curates and disseminates knowledge about clinically actionable gene-drug associations and genotype-phenotype relationships. Downloaded from its official website in June 2023 8. Predicted functional associations network [STRING](https://string-db.org/) It is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. @@ -38,14 +40,14 @@ Searches the identifier of a list of genes of different genomics databases and r - URL: /gene-symbols - Method: POST - Params: A body in Json format with the following content - - `gene_ids` : list of identifiers that you want to get your approved symbols + - `gene_ids` : list of identifiers that you want to get your approved symbols - Success Response: - - Code: 200 - - Content: - - ``: Returns a Json with as many keys as there are genes in the body. For each gene, the value is a list with the valid symbols. - - Example: - - URL: http://localhost:8000/gene-symbols - - body: + - Code: 200 + - Content: + - ``: Returns a Json with as many keys as there are genes in the body. For each gene, the value is a list with the valid symbols. + - Example: + - URL: + - body: `{ "gene_ids":[ "BRCA1", @@ -54,7 +56,8 @@ Searches the identifier of a list of genes of different genomics databases and r "FANCS" ] }` - - Response: + - Response: + ```json { "BRCA1":[ @@ -73,22 +76,22 @@ Searches the identifier of a list of genes of different genomics databases and r } ``` - ### Genes symbols finder Service that takes a string of any length and returns a list of genes that contain that search criteria. - URL: /gene-symbols-finder - Method: GET -- Params: - - `query` : gene search string - - `limit`: number of elements returned by the service. Default 50. +- Params: + - `query` : gene search string + - `limit`: number of elements returned by the service. Default 50. - Success Response: - - Code: 200 - - Content: a list of gene symbols matching the search criteria. - - Example: - - URL: http://localhost:8000/gene-symbols-finder/?limit=50&query=BRC - - Response: + - Code: 200 + - Content: a list of gene symbols matching the search criteria. + - Example: + - URL: + - Response: + ```json [ "BRCA1", @@ -99,7 +102,6 @@ Service that takes a string of any length and returns a list of genes that conta ] ``` - ### Genes information From a list of valid genes, it obtains different information for the human reference genomes GRCh38 and GRCh37. @@ -107,41 +109,42 @@ From a list of valid genes, it obtains different information for the human refer - URL: /information-of-genes - Method: POST - Params: A body in Json format with the following content - - `gene_ids` : list of valid genes identifiers + - `gene_ids` : list of valid genes identifiers - Success Response: - - Code: 200 - - Content: - - ``: Returns a Json with as many keys as there are valid genes in the body. For each gene, the value is a Json with the following format: - - `alias_symbol`: alternative symbols for a known gene - - `percentage_gene_gc_content`: Ratio of guanine and cytosine nucleotides in the DNA sequence of the gene - - `oncokb_cancer_gene`: return "Oncogene" or "Tumor Suppressor Gene" only if the gene has this information in the OncoKB database - - `name`: gene name according to the HGNC database - - `band`: cytoband or specific location in the genome - - `chromosome`: chromosome where the gene is located - - `start_position`: chromosomal position of gene starts for the reference genome GRCh38 - - `end_position`: chromosomal position of gene ends for the reference genome GRCh38 - - `start_GRCh37`: chromosomal position of gene starts for the reference genome GRCh37 - - `end_GRCh37`: chromosomal position of gene ends for the reference genome GRCh37 - - `strand`: DNA strand containing the coding sequence for the gene - - `gene_biotype`: gene type (examples: protein_coding or miRNA) - - `refseq_summary`: complete description of the gene according to the RefSeq database (RefSeq : NCBI Reference Sequences) - - `civic_description`: Description of the clinical relevance of the gene according to the CiVIC (Clinical Interpretation of Variants in Cancer) database - - `hgnc_id`: Gene identifier in the HGNC database - - `uniprot_ids`: Gene identifier in the Uniprot database - - `omim_id`: Gene identifier in the OMIM database - - `ensembl_gene_id` : Gene identifier in the Ensembl database - - `entrez_id` : Gene identifier in the NCBI Entrez database - - Example: - - URL: http://localhost:8000/information-of-genes - - body: + - Code: 200 + - Content: + - ``: Returns a Json with as many keys as there are valid genes in the body. For each gene, the value is a Json with the following format: + - `alias_symbol`: alternative symbols for a known gene + - `percentage_gene_gc_content`: Ratio of guanine and cytosine nucleotides in the DNA sequence of the gene + - `oncokb_cancer_gene`: return "Oncogene" or "Tumor Suppressor Gene" only if the gene has this information in the OncoKB database + - `name`: gene name according to the HGNC database + - `band`: cytoband or specific location in the genome + - `chromosome`: chromosome where the gene is located + - `start_position`: chromosomal position of gene starts for the reference genome GRCh38 + - `end_position`: chromosomal position of gene ends for the reference genome GRCh38 + - `start_GRCh37`: chromosomal position of gene starts for the reference genome GRCh37 + - `end_GRCh37`: chromosomal position of gene ends for the reference genome GRCh37 + - `strand`: DNA strand containing the coding sequence for the gene + - `gene_biotype`: gene type (examples: protein_coding or miRNA) + - `refseq_summary`: complete description of the gene according to the RefSeq database (RefSeq : NCBI Reference Sequences) + - `civic_description`: Description of the clinical relevance of the gene according to the CiVIC (Clinical Interpretation of Variants in Cancer) database + - `hgnc_id`: Gene identifier in the HGNC database + - `uniprot_ids`: Gene identifier in the Uniprot database + - `omim_id`: Gene identifier in the OMIM database + - `ensembl_gene_id` : Gene identifier in the Ensembl database + - `entrez_id` : Gene identifier in the NCBI Entrez database + - Example: + - URL: + - body: `{ "gene_ids":[ "INVALIDGENE", - "MIR365A", + "MC1R", "ALK" ] }` - - Response: + - Response: + ```json { "ALK":{ @@ -151,7 +154,7 @@ From a list of valid genes, it obtains different information for the human refer ], "band":"p23.1", "chromosome":"2", - "civic_description":"ALK amplifications, fusions and mutations have been shown to be driving events in non-small cell lung cancer...", + "civic_description":"ALK amplifications, fusions and mutations have been shown to be driving ...", "end_GRCh37":30144432, "end_position":29921586, "ensembl_gene_id":"ENSG00000171094", @@ -168,52 +171,55 @@ From a list of valid genes, it obtains different information for the human refer "strand":-1, "uniprot_ids":"Q9UM73" }, - "MIR365A":{ - "alias_symbol":"hsa-mir-365-1", - "band":"p13.12", + "MC1R":{ + "alias_symbol":"MSH-R", + "band":"q24.3", "chromosome":"16", - "end_GRCh37":14403228, - "end_position":14309371, - "ensembl_gene_id":"ENSG00000199130", - "entrez_id":"100126355", - "gene_biotype":"miRNA", - "hgnc_id":"HGNC:33692", - "name":"microRNA 365a", - "omim_id":"614735", - "percentage_gene_gc_content":44.83, - "refseq_summary":"microRNAs (miRNAs) are short (20-24 nt) non-coding RNAs that are involved in post-transcriptional regulation ...", - "start_GRCh37":14403142, - "start_position":14309285, - "strand":1 + "civic_description":"", + "end_GRCh37":89987385, + "end_position":89920973, + "ensembl_gene_id":"ENSG00000258839", + "entrez_id":"4157", + "gene_biotype":"protein_coding", + "hgnc_id":"HGNC:6929", + "name":"melanocortin 1 receptor", + "omim_id":"155555", + "percentage_gene_gc_content":58.17, + "refseq_summary":"This intronless gene encodes the receptor protein for melanocyte-stimulating hormone (MSH). The encoded protein...", + "start_GRCh37":89978527, + "start_position":89912119, + "strand":1, + "uniprot_ids":"Q01726" } } ``` - Keep in mind: + + Keep in mind: - If a gene passed in the body is not found in the database (invalid gene symbol), it will not appear in the response. - If one of the fields for a gene has no value, it will not appear in the response. - ### Gene Groups Gets the identifier of a gene, validates it and then returns the group of genes to which it belongs according to HGNC, and all the other genes that belong to the same group. - URL: /genes-of-its-group/<*gene_id*> - - <*gene_id*> is the identifier of the gene for any database + - <*gene_id*> is the identifier of the gene for any database - Method: GET - Params: - - Success Response: - - Code: 200 - - Content: - - `gene_id`: HGNC approved gene symbol. - - `locus_group`: - - `locus_type`: - - `groups`: - - `gene_group`: - - `gene_group_id`: - - `genes`: - - Example: - - URL: http://localhost:8000/genes-of-its-group/ENSG00000146648 - - Response: + - Code: 200 + - Content: + - `gene_id`: HGNC approved gene symbol. + - `locus_group`: + - `locus_type`: + - `groups`: + - `gene_group`: + - `gene_group_id`: + - `genes`: + - Example: + - URL: + - Response: + ```json { "gene_id":"EGFR", @@ -234,35 +240,35 @@ Gets the identifier of a gene, validates it and then returns the group of genes } ``` - ### Genes of a metabolic pathway Get the list of genes that are involved in a pathway for a given database. - URL: /pathway-genes/<*source*>/<*external_id*> - - <*source*>: Database to query. Use lowercase. Valid Options: - - kegg ([link](https://www.genome.jp/kegg/)) - - biocarta ([link](https://maayanlab.cloud/Harmonizome/resource/Biocarta)) - - ehmn ([link](http://allie.dbcls.jp/pair/EHMN;Edinburgh+Human+Metabolic+Network.html)) - - humancyc ([link](https://humancyc.org/)) - - inoh ([link](https://dbarchive.biosciencedbc.jp/en/inoh/desc.html)) - - netpath ([link](https://www.wikipathways.org/index.php/Portal:NetPath)) - - pid ([link](https://github.com/NCIP/pathway-interaction-database)) - - reactome ([link](https://reactome.org/)) - - smpdb ([link](https://www.smpdb.ca/)) - - signalink ([link](http://signalink.org/)) - - wikipathways ([link](https://www.wikipathways.org/index.php/WikiPathways)) + - <*source*>: Database to query. Use lowercase. Valid Options: + - kegg ([link](https://www.genome.jp/kegg/)) + - biocarta ([link](https://maayanlab.cloud/Harmonizome/resource/Biocarta)) + - ehmn ([link](http://allie.dbcls.jp/pair/EHMN;Edinburgh+Human+Metabolic+Network.html)) + - humancyc ([link](https://humancyc.org/)) + - inoh ([link](https://dbarchive.biosciencedbc.jp/en/inoh/desc.html)) + - netpath ([link](https://www.wikipathways.org/index.php/Portal:NetPath)) + - pid ([link](https://github.com/NCIP/pathway-interaction-database)) + - reactome ([link](https://reactome.org/)) + - smpdb ([link](https://www.smpdb.ca/)) + - signalink ([link](http://signalink.org/)) + - wikipathways ([link](https://www.wikipathways.org/index.php/WikiPathways)) Using an invalid option returns an empty list of genes. - - <*external_id*>: Pathway identifier in the source database. + - <*external_id*>: Pathway identifier in the source database. - Method: GET - Params: - - Success Response: - - Code: 200 - - Content: - - `genes`: a list of genes involved in the metabolic pathway. - - Example: - - URL: http://localhost:8000/pathway-genes/kegg/hsa00740 - - Response: + - Code: 200 + - Content: + - `genes`: a list of genes involved in the metabolic pathway. + - Example: + - URL: + - Response: + ```json { "genes":[ @@ -278,7 +284,6 @@ Get the list of genes that are involved in a pathway for a given database. } ``` - ### Metabolic pathways from different genes Gets the common pathways for a list of genes. @@ -286,24 +291,25 @@ Gets the common pathways for a list of genes. - URL: /pathways-in-common - Method: POST - Params: A body in Json format with the following content - - `gene_ids`: list of genes for which you want to get the common metabolic pathways. If you use a list with a single gene, then you will get all the pathways for that gene + - `gene_ids`: list of genes for which you want to get the common metabolic pathways. If you use a list with a single gene, then you will get all the pathways for that gene - Success Response: - - Code: 200 - - Content: - - `pathways`: list of elements of type Json. Each element corresponds to a different metabolic pathway. - - `source`: database of the metabolic pathway found. - - `external_id`: pathway identifier in the source. - - `pathway`: name of the pathway. - - Example: - - URL: http://localhost:8000/pathways-in-common - - body: + - Code: 200 + - Content: + - `pathways`: list of elements of type Json. Each element corresponds to a different metabolic pathway. + - `source`: database of the metabolic pathway found. + - `external_id`: pathway identifier in the source. + - `pathway`: name of the pathway. + - Example: + - URL: + - body: `{ "gene_ids":[ "HLA-B", "BRAF" ] }` - - Response: + - Response: + ```json { "pathways":[ @@ -316,7 +322,6 @@ Gets the common pathways for a list of genes. } ``` - ### Gene expression This service gets gene expression in healthy tissue @@ -324,16 +329,16 @@ This service gets gene expression in healthy tissue - URL: /expression-of-genes - Method: POST - Params: A body in Json format with the following content - - `gene_ids`: list of genes for which you want to get the expression. - - `tissue`: healthy tissue from which you want to get the expression values. + - `gene_ids`: list of genes for which you want to get the expression. + - `tissue`: healthy tissue from which you want to get the expression values. - Success Response: - - Code: 200 - - Content: + - Code: 200 + - Content: The response you get is a list. Each element of the list is a new list containing the expression values for each gene in the same sample from the GTEx database. - - ``: expression value for the gene_id. - - Example: - - URL: http://localhost:8000/expression-of-genes - - body: + - ``: expression value for the gene_id. + - Example: + - URL: + - body: `{ "tissue":"Skin", "gene_ids":[ @@ -341,8 +346,9 @@ This service gets gene expression in healthy tissue "BRCA2" ] }` - - Response: - ```json + - Response: + + ```json [ [ { @@ -369,55 +375,60 @@ This service gets gene expression in healthy tissue } ] ] - ``` - keep in mind: - - As an example only three samples are shown. Note that in the GTEx database there may be more than 2500 samples for a given healthy tissue. - - If one of the genes entered as a parameter corresponds to an invalid symbol, the response will omit the values for that gene. It is recommended to use the *"Genes symbols validator"* service to validate your genes before using this functionality. + ``` + keep in mind: + - As an example only three samples are shown. Note that in the GTEx database there may be more than 2500 samples for a given healthy tissue. + - If one of the genes entered as a parameter corresponds to an invalid symbol, the response will omit the values for that gene. It is recommended to use the *"Genes symbols validator"* service to validate your genes before using this functionality. ### Therapies and actionable genes in cancer (OncoKB) -This service retrieves information of actionable genes and drugs obtained from the OncoKB database, at a therapeutic, diagnostic and prognostic level. + +This service retrieves information of FDA-Approved precision oncology therapies and +actionable genes and drugs obtained from the OncoKB database, at a therapeutic, diagnostic and prognostic level. - URL: /information-of-oncokb - Method: POST - Params: A body in Json format with the following content - - `gene_ids`: list of genes for which you want to get the information from OncoKB database. + - `gene_ids`: list of genes for which you want to get the information from OncoKB database. + - `query`: Optional. Parameter used to show only the results that match it. The query is used to find matches within the information offered by OncoKB for different "precision oncology therapies", "types of cancer", "biomarker detection methods" or "drugs". - Success Response: - - Code: 200 - - Content: - - ``: Returns a Json with as many keys as there are valid genes in the body. For each gene, the value is a Json with the following format: - - ``: Evidence of the gene for therapeutic. The value is a list of elements of the Json type, where each element is a different evidence with the following structure: - - ``: therapeutic drug. - - ``: level of therapeutic evidence. - - ``: specific cancer gene alterations. - - ``: type of cancer. - - ``: Evidence of the gene for diagnosis (for hematologic malignancies only). The value is a list of elements of the Json type, where each element is a different evidence with the following structure: - - ``: level of diagnostic evidence. - - ``: specific cancer gene alterations. - - ``: type of cancer. - - ``: Evidence of the gene for prognostic (for hematologic malignancies only). The value is a list of elements of the Json type, where each element is a different evidence with the following structure: - - ``: level of prognostic evidence. - - ``: specific cancer gene alterations. - - ``: type of cancer. - - ``: type of cancer gene. Oncogene and/or Tumor Suppressor Gene. - - ``: gene transcript according to the RefSeq database. - - ``: list of sources where there is evidence of the relationship of the gene with cancer. These may be different sequencing panels, the [Sanger Cancer Gene Census](https://www.sanger.ac.uk/data/cancer-gene-census/), or [Vogelstein et al. (2013)](http://science.sciencemag.org/content/339/6127/1546.full). - - ``: FDA-approved therapies that are considered precision oncology therapies by OncoKB™. The value is a list of elements of the Json type, where each element is a different precision oncology therapy with the following structure: - - ``: A drug that is most effective in a molecularly defined subset of patients and for which pre-treatment molecular profiling is required for optimal patient selection. - - ``: Year of drug’s first FDA-approval. The first year the drug received FDA-approval in any indication, irrespective of whether the biomarker was included in the FDA-drug at that time. - - ``: Possible classifications are first-in-class, mechanistically-distinct, follow-on, or resistance based on [Suehnholz et al. Cancer Discovery 2023](https://aacrjournals.org/cancerdiscovery/article/doi/10.1158/2159-8290.CD-23-0467/729589/Quantifying-the-Expanding-Landscape-of-Clinical). Only drugs with an FDA-specified biomarker that can be detected by a DNA/NGS-based detection method are classified. - - ``: Biomarkers related to therapy according to the FDA. Includes pathognomonic and indication-specific biomarkers, that while not specifically listed in the Indications and Usage section of the FDA drug label, are targeted by the precision oncology drug. - - ``: Biomarker detection method. If there is a corresponding FDA-cleared or -approved companion diagnostic device for biomarker identification, the detection method associated with this device is listed; if the biomarker can be detected by a DNA/NGS-based detection method this is listed first. - - - Example: - - URL: http://localhost:8000/information-of-oncokb - - body: + - Code: 200 + - Content: + - ``: Returns a Json with as many keys as there are valid genes in the body. For each gene, the value is a Json with the following format: + - ``: Evidence of the gene for therapeutic. The value is a list of elements of the Json type, where each element is a different evidence with the following structure: + - ``: therapeutic drug. + - ``: level of therapeutic evidence. + - ``: specific cancer gene alterations. + - ``: type of cancer. + - ``: Evidence of the gene for diagnosis (for hematologic malignancies only). The value is a list of elements of the Json type, where each element is a different evidence with the following structure: + - ``: level of diagnostic evidence. + - ``: specific cancer gene alterations. + - ``: type of cancer. + - ``: Evidence of the gene for prognostic (for hematologic malignancies only). The value is a list of elements of the Json type, where each element is a different evidence with the following structure: + - ``: level of prognostic evidence. + - ``: specific cancer gene alterations. + - ``: type of cancer. + - ``: type of cancer gene. Oncogene and/or Tumor Suppressor Gene. + - ``: gene transcript according to the RefSeq database. + - ``: list of sources where there is evidence of the relationship of the gene with cancer. These may be different sequencing panels, the [Sanger Cancer Gene Census](https://www.sanger.ac.uk/data/cancer-gene-census/), or [Vogelstein et al. (2013)](http://science.sciencemag.org/content/339/6127/1546.full). + - ``: FDA-approved therapies that are considered precision oncology therapies by OncoKB™. The value is a list of elements of the Json type, where each element is a different precision oncology therapy with the following structure: + - ``: A drug that is most effective in a molecularly defined subset of patients and for which pre-treatment molecular profiling is required for optimal patient selection. + - ``: Year of drug’s first FDA-approval. The first year the drug received FDA-approval in any indication, irrespective of whether the biomarker was included in the FDA-drug at that time. + - ``: Possible classifications are first-in-class, mechanistically-distinct, follow-on, or resistance based on [Suehnholz et al. Cancer Discovery 2023](https://aacrjournals.org/cancerdiscovery/article/doi/10.1158/2159-8290.CD-23-0467/729589/Quantifying-the-Expanding-Landscape-of-Clinical). Only drugs with an FDA-specified biomarker that can be detected by a DNA/NGS-based detection method are classified. + - ``: Biomarkers related to therapy according to the FDA. Includes pathognomonic and indication-specific biomarkers, that while not specifically listed in the Indications and Usage section of the FDA drug label, are targeted by the precision oncology drug. + - ``: Biomarker detection method. If there is a corresponding FDA-cleared or -approved companion diagnostic device for biomarker identification, the detection method associated with this device is listed; if the biomarker can be detected by a DNA/NGS-based detection method this is listed first. + + - Example: + - URL: + - body: `{ "gene_ids":[ "ATM" - ] + ], + "query": "Olaparib" }` - - Response: + - Response: + ```json { "ATM":{ @@ -431,13 +442,6 @@ This service retrieves information of actionable genes and drugs obtained from t "fda_recognized_biomarkers":"ATM, BARD1, BRCA1/2, BRIP1, CDK12, CHEK1/2, FANCL, PALB2, RAD51B, RAD51C, RAD51D, RAD54 Oncogenic Mutations", "method_of_biomarker_detection":"DNA/NGS-based detection", "precision_oncology_therapy":"Olaparib" - }, - { - "drug_classification":"Follow-on", - "fda_first_approval":"2018", - "fda_recognized_biomarkers":"ATM, ATR, BRCA1/2, CDK12, CHEK2, FANCA, MLH1, MRE11, NBN, PALB2, RAD51C Oncogenic Mutations", - "method_of_biomarker_detection":"DNA/NGS-based detection", - "precision_oncology_therapy":"Talazoparib" } ], "refseq_transcript":"NM_000051.3", @@ -459,14 +463,14 @@ This service retrieves information of actionable genes and drugs obtained from t } ] } - } + +} ``` - Keep in mind: + Keep in mind: - If a gene passed in the body is not found in the database, it will not appear in the response. - If one of the fields for a gene has no value in the database, it will not appear in the response. - Values for cancer types use the [OncoTree nomenclature](http://oncotree.mskcc.org/). - ### Gene Ontology terms related to a list of genes Gets the list of related terms for a list of genes. @@ -474,40 +478,40 @@ Gets the list of related terms for a list of genes. - URL: /genes-to-terms - Method: POST - Params: A body in Json format with the following content - - `gene_ids`: list of genes for which you want to get the terms in common (they must be a list, and have to be in HGNC gene symbol format) - - `filter_type`: by default "intersection", in which case it bring all the terms that are related to all the genes, another option is "union" which brings all the terms that are related to **at least** on gene. The third option is "enrichment", it does a gene enrichment analysis on the input of genes with [g:Profiler library](https://biit.cs.ut.ee/gprofiler/page/docs). This filter type has 2 extra parameters: - - `p_value_threshold`: 0.05 by default. It's the p-value threshold for significance, results with smaller p-value are tagged + - `gene_ids`: list of genes for which you want to get the terms in common (they must be a list, and have to be in HGNC gene symbol format) + - `filter_type`: by default "intersection", in which case it bring all the terms that are related to all the genes, another option is "union" which brings all the terms that are related to **at least** on gene. The third option is "enrichment", it does a gene enrichment analysis on the input of genes with [g:Profiler library](https://biit.cs.ut.ee/gprofiler/page/docs). This filter type has 2 extra parameters: + - `p_value_threshold`: 0.05 by default. It's the p-value threshold for significance, results with smaller p-value are tagged as significant. Must be a float. Not recommended to set it higher than 0.05. - - `correction_method`: The enrichment default correction method is "analytical" which uses multiple testing correction and applies g:Profiler tailor-made algorithm [g:SCS](https://biit.cs.ut.ee/gprofiler/page/docs#significance_threhshold) for reducing significance scores. Alternatively, one may select "bonferroni" correction or "false_discovery_rate" (Benjamini-Hochberg FDR). - - `relation_type`: filters the relation between genes and terms. By default it's ["enables","involved_in","part_of","located_in"]. It should always be a list containing any permutation of the allowed relations. Only valid on `filter_type` intersection and union. should not be present on "enrichment" - - `ontology_type`: filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]. It should always be a list containing any permutation of the 3 ontologies. + - `correction_method`: The enrichment default correction method is "analytical" which uses multiple testing correction and applies g:Profiler tailor-made algorithm [g:SCS](https://biit.cs.ut.ee/gprofiler/page/docs#significance_threhshold) for reducing significance scores. Alternatively, one may select "bonferroni" correction or "false_discovery_rate" (Benjamini-Hochberg FDR). + - `relation_type`: filters the relation between genes and terms. By default it's ["enables","involved_in","part_of","located_in"]. It should always be a list containing any permutation of the allowed relations. Only valid on `filter_type` intersection and union. should not be present on "enrichment" + - `ontology_type`: filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]. It should always be a list containing any permutation of the 3 ontologies. - Success Response: - - Code: 200 - - Content: - The response you get is a list. Each element of the list is a GO term that fulfills the conditions of the query. GO terms can contain name, definition, relations to other terms, etc. - - ``: Unique identifier. - - ``: human-readable term name. - - ``: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. - - ``: A textual description of what the term represents, plus reference(s) to the source of the information. - - relations to other terms: Each go term can be related to many other terms with a [variety of relations](http://geneontology.org/docs/ontology-relations/). - - ``: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope. - - ``: Indicates that the term belongs to a designated subset of terms. - - ``: list of elements of type Json. Each element corresponds to a gene and how it's related to the term. - - ``: name of the gene. - - ``: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler". - - ``: evidence code to indicate how the annotation to a particular term is supported. - - ``: . - - ``: Hypergeometric p-value after correction for multiple testing. - - ``: The number of genes in the query that are annotated to the corresponding term. - - ``: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance. - - ``: The number of genes that were included in the query. - - ``: The number of genes that are annotated to the term. - - ``: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size. - - ``: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size. - - Example: - - URL: http://localhost:8000/genes-to-terms - - body: + - Code: 200 + - Content: + The response you get is a list. Each element of the list is a GO term that fulfills the conditions of the query. GO terms can contain name, definition, relations to other terms, etc. + - ``: Unique identifier. + - ``: human-readable term name. + - ``: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. + - ``: A textual description of what the term represents, plus reference(s) to the source of the information. + - relations to other terms: Each go term can be related to many other terms with a [variety of relations](http://geneontology.org/docs/ontology-relations/). + - ``: Alternative words or phrases closely related in meaning to the term name, with indication of the relationship between the name and synonym given by the synonym scope. + - ``: Indicates that the term belongs to a designated subset of terms. + - ``: list of elements of type Json. Each element corresponds to a gene and how it's related to the term. + - ``: name of the gene. + - ``: the type of relation between the gene and the GO term. When `filter_type` is enrichment, extra relation will be gather from g:Profiler database. These relations will be shown as "relation obtained from gProfiler". + - ``: evidence code to indicate how the annotation to a particular term is supported. + - ``: . + - ``: Hypergeometric p-value after correction for multiple testing. + - ``: The number of genes in the query that are annotated to the corresponding term. + - ``: The total number of genes "in the universe " which is used as one of the four parameters for the hypergeometric probability function of statistical significance. + - ``: The number of genes that were included in the query. + - ``: The number of genes that are annotated to the term. + - ``: The proportion of genes in the input list that are annotated to the function. Defined as intersection_size/query_size. + - ``: The proportion of functionally annotated genes that the query recovers. Defined as intersection_size/term_size. + - Example: + - URL: + - body: `{ "gene_ids":[ "TMCO4" @@ -519,8 +523,9 @@ as significant. Must be a float. Not recommended to set it higher than 0.05. "molecular_function" ] }` - - Response: - ```json + - Response: + + ```json [ { "alt_id":[ @@ -553,7 +558,8 @@ as significant. Must be a float. Not recommended to set it higher than 0.05. ] } ] - ``` + ``` + ### Gene Ontology terms related to another specific term Gets the list of related terms to a term. @@ -561,29 +567,30 @@ Gets the list of related terms to a term. - URL: /related-terms - Method: POST - Params: A body in Json format with the following content - - `term_id`: The term ID of the term you want to search - - `relations`: Filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list - - `ontology_type`: Filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the 3 ontologies - - `general_depth`: The search depth for the non-hierarchical relations - - `hierarchical_depth_to_children`: The search depth for the hierarchical relations in the direction of the children - - `to_root`: 0 for false 1 fot true. If true get all the terms in the hierarchical relations in the direction of the root + - `term_id`: The term ID of the term you want to search + - `relations`: Filters the non-hierarchical relations between terms. By default it's ["part_of","regulates","has_part"]. It should always be a list + - `ontology_type`: Filters the ontology type of the terms in the response. By default it's ["biological_process", "molecular_function", "cellular_component"]It should always be a list containing any permutation of the 3 ontologies + - `general_depth`: The search depth for the non-hierarchical relations + - `hierarchical_depth_to_children`: The search depth for the hierarchical relations in the direction of the children + - `to_root`: 0 for false 1 fot true. If true get all the terms in the hierarchical relations in the direction of the root - Success Response: - - Code: 200 - - Content: The response you get is a list of GO terms related to the searched term that fulfills the conditions of the query. Each term has: - - ``: ID of the GO term - - ``: Name of the GO term - - ``: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. - - ``: Dictionary of relations - - ``: List of terms related by that relation type to the term - - Example: - - URL: http://localhost:8000/related-terms - - body: + - Code: 200 + - Content: The response you get is a list of GO terms related to the searched term that fulfills the conditions of the query. Each term has: + - ``: ID of the GO term + - ``: Name of the GO term + - ``: Denotes which of the three sub-ontologies (cellular component, biological process or molecular function) the term belongs to. + - ``: Dictionary of relations + - ``: List of terms related by that relation type to the term + - Example: + - URL: + - body: `{ "term_id":"0000079", "general_depth":5, "to_root":0 }` - - Response: + - Response: + ```json [ { @@ -606,7 +613,7 @@ Gets the list of related terms to a term. } ] ``` - + ### Cancer related drugs (PharmGKB) Gets a list of related drugs to a list of genes. @@ -614,27 +621,28 @@ Gets a list of related drugs to a list of genes. - URL: /drugs-pharm-gkb - Method: POST - Params: A body in Json format with the following content - - `gene_ids`: list of genes for which the related drugs + - `gene_ids`: list of genes for which the related drugs - Success Response: - - Code: 200 - - Content: The response you get is a dictionary where the genes are the keys and the values are a list of all the related drug information - - ``: Identifier assigned to this drug label by PharmGKB - - ``: Name assigned to the label by PharmGKB - - ``: The source that originally authored the label (e.g. FDA, EMA) - - ``: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge) - - ``: PGx testing level as annotated by PharmGKB based on definitions at https://www.pharmgkb.org/page/drugLabelLegend - - ``: Related chemicals - - ``: List of related genes - - ``: Related variants and/or haplotypes - - Example: - - URL: http://localhost:8000/drugs-pharm-gkb - - body: + - Code: 200 + - Content: The response you get is a dictionary where the genes are the keys and the values are a list of all the related drug information + - ``: Identifier assigned to this drug label by PharmGKB + - ``: Name assigned to the label by PharmGKB + - ``: The source that originally authored the label (e.g. FDA, EMA) + - ``: "On" if drug in this label appears on the FDA Biomarker list; "Off (Formerly On)" if the label was on the FDA Biomarker list at one time; "Off (Never On)" if the label was never listed on the FDA Biomarker list (to PharmGKB's knowledge) + - ``: PGx testing level as annotated by PharmGKB based on definitions at + - ``: Related chemicals + - ``: List of related genes + - ``: Related variants and/or haplotypes + - Example: + - URL: + - body: `{ "gene_ids":[ "JAK2" ] }` - - Response: + - Response: + ```json { "JAK2":[ @@ -652,44 +660,46 @@ Gets a list of related drugs to a list of genes. } ] } - ``` + ``` ### Predicted functional associations network (String) Gets a list of genes and relations related to a gene. + - URL: /string-relations - Method: POST - Params: A body in Json format with the following content - - `gene_id`: target gene - - `min_combined_score`: the minimum combined scored allowed int the relations. Possible scores go from 1 to 1000 + - `gene_id`: target gene + - `min_combined_score`: the minimum combined scored allowed int the relations. Possible scores go from 1 to 1000 - Success Response: - - Code: 200 - - Content: The response you get is a list of relations containing the targeted gene - - ``: Gene 1 in the bidirectional relationship - - ``: Gene 2 in the bidirectional relationship - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Optional. Values range from 1 to 1000 - - `: Values range from 1 to 1000 - - - Example: - - URL: http://localhost:8000/string-relations - - body: + - Code: 200 + - Content: The response you get is a list of relations containing the targeted gene + - ``: Gene 1 in the bidirectional relationship + - ``: Gene 2 in the bidirectional relationship + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Optional. Values range from 1 to 1000 + - `: Values range from 1 to 1000 + + - Example: + - URL: + - body: `{ "gene_id":"MX2", "min_combined_score":996 }` - - Response: + - Response: + ```json [ { @@ -704,23 +714,24 @@ Gets a list of genes and relations related to a gene. "textmining_transferred":257 } ] - ``` + ``` ### Drugs that regulate a gene -Service that takes gene symbol and returns a link to https://go.drugbank.com with all the drugs that upregulate and down regulate its expresion. Useful for embeding. +Service that takes gene symbol and returns a link to with all the drugs that upregulate and down regulate its expresion. Useful for embeding. - URL: drugs-regulating-gene/<*gene_id*> - - <*gene_id*> is the identifier of the gene + - <*gene_id*> is the identifier of the gene - Method: GET - Params: - - Success Response: - - Code: 200 - - Content: The response you get is a dictionary with a single key called 'link' where its value is a URL that points to the information on the DrugBank website. - - ``: Link to DrugBank website - - Example: - - URL: http://localhost:8000/drugs-regulating-gene/TP53 - - Response: + - Code: 200 + - Content: The response you get is a dictionary with a single key called 'link' where its value is a URL that points to the information on the DrugBank website. + - ``: Link to DrugBank website + - Example: + - URL: + - Response: + ```json { "link": "https://go.drugbank.com/pharmaco/transcriptomics?q%5Bg%5B0%5D%5D%5Bm%5D=or&q%5Bg%5B0%5D%5D%5Bdrug_approved_true%5D=all&q%5Bg%5B0%5D%5D%5Bdrug_nutraceutical_true%5D=all&q%5Bg%5B0%5D%5D%5Bdrug_illicit_true%5D=all&q%5Bg%5B0%5D%5D%5Bdrug_investigational_true%5D=all&q%5Bg%5B0%5D%5D%5Bdrug_withdrawn_true%5D=all&q%5Bg%5B0%5D%5D%5Bdrug_experimental_true%5D=all&q%5Bg%5B1%5D%5D%5Bm%5D=or&q%5Bg%5B1%5D%5D%5Bdrug_available_in_us_true%5D=all&q%5Bg%5B1%5D%5D%5Bdrug_available_in_ca_true%5D=all&q%5Bg%5B1%5D%5D%5Bdrug_available_in_eu_true%5D=all&commit=Apply+Filter&q%5Bdrug_precise_names_name_cont%5D=&q%5Bgene_symbol_eq%5D=TP53&q%5Bgene_id_eq%5D=&q%5Bchange_eq%5D=&q%5Binteraction_cont%5D=&q%5Bchromosome_location_cont%5D=" @@ -730,11 +741,12 @@ Service that takes gene symbol and returns a link to https://go.drugbank.com wit ## Error Responses The possible error codes are 400, 404 and 500. The content of each of them is a Json with a unique key called "error" where its value is a description of the problem that produces the error. For example: -```json -{ - "error": "404 Not Found: invalid gene identifier" -} -``` + + ```json + { + "error": "404 Not Found: invalid gene identifier" + } + ``` ## Contributing @@ -744,7 +756,6 @@ All kind of contribution is welcome! If you want to contribute just: 2. Create a new branch and introduce there your new changes. 3. Make a Pull Request! - ### Run Flask dev server 1. Start up Docker services like MongoDB: `docker compose -f docker-compose.dev.yml up -d` @@ -753,11 +764,9 @@ All kind of contribution is welcome! If you want to contribute just: **NOTE:** If you are looking for documentation for a production deployment see [DEPLOYING.md](DEPLOYING.md). - ### Tests To run all the tests: 1. Go to the `bioapi` folder. 2. Run the `pytest` command. - diff --git a/bio-api/bioapi.py b/bio-api/bioapi.py index 0a36039..ce68b1f 100755 --- a/bio-api/bioapi.py +++ b/bio-api/bioapi.py @@ -19,7 +19,7 @@ PROCESS_POOL_WORKERS: int = int(os.getenv('PROCESS_POOL_WORKERS', 4)) # BioAPI version -VERSION = '1.1.1' +VERSION = '1.2.0' # Valid pathways sources PATHWAYS_SOURCES = ["kegg", "biocarta", "ehmn", "humancyc", "inoh", "netpath", "pid", "reactome", @@ -525,7 +525,7 @@ def cancer_drugs_related_to_gene(gene: str) -> List: return list(collection_pharm.find({"genes": gene}, {"_id": 0})) -def get_data_from_oncokb(genes: List[str]) -> Dict[str, Dict[str, Any]]: +def get_data_from_oncokb(genes: List[str], query: str) -> Dict[str, Dict[str, Any]]: """ Gets all data from OncoKB database associated with a gene list. @@ -543,6 +543,8 @@ def get_data_from_oncokb(genes: List[str]) -> Dict[str, Dict[str, Any]]: docs_oncology_therapies = collection_precision_oncology_therapies.find( query2, projection) res = {} + patron = re.compile(query, re.IGNORECASE) + for doc_a in docs_actionability: gen = doc_a.pop("gene") classification = doc_a.pop("classification").lower() @@ -550,7 +552,18 @@ def get_data_from_oncokb(genes: List[str]) -> Dict[str, Dict[str, Any]]: res[gen] = {} if classification not in res[gen]: res[gen][classification] = [] - res[gen][classification].append(doc_a) + + if query == "": + res[gen][classification].append(doc_a) + + else: + # Search for query in predefinided keys + for key, value in doc_a.items(): + if key in ["cancer_types", "drugs"]: + if re.search(patron, value): + res[gen][classification].append(doc_a) + if len(res[gen][classification]) == 0: + res[gen].pop(classification) for doc_c in docs_cancer: gen = doc_c.pop("hgnc_symbol") @@ -583,12 +596,18 @@ def get_data_from_oncokb(genes: List[str]) -> Dict[str, Dict[str, Any]]: res[gen_t] = {} if "precision_therapies" not in res[gen_t]: res[gen_t]["precision_therapies"] = [] - res[gen_t]["precision_therapies"].append(doc_t) - + if query == "": + res[gen_t]["precision_therapies"].append(doc_t) + else: + # Search for query in predefinided keys + for key, value in doc_t.items(): + if key in ["precision_oncology_therapy", "method_of_biomarker_detection"]: + if re.search(patron, value): + res[gen_t]["precision_therapies"].append(doc_t) + if len(res[gen_t]["precision_therapies"]) == 0: + res[gen_t].pop("precision_therapies") return res -# string - def associated_string_genes(gene_symbol: str, min_combined_score: int = 400) -> List: """Given a gene returns all the related genes and all the relations between them @@ -941,13 +960,15 @@ def oncokb_data(): abort(400, "gene_ids is mandatory") gene_ids = body['gene_ids'] - if type(gene_ids) != list: + query = "" if "query" not in body else body['query'] + + if not isinstance(gene_ids, list): abort(400, "gene_ids must be a list") if len(gene_ids) == 0: abort(400, "gene_ids must contain at least one gene symbol") - data = get_data_from_oncokb(gene_ids) + data = get_data_from_oncokb(gene_ids, query) return jsonify(data) diff --git a/databases/gene_info/geneinfo2mongodb.sh b/databases/gene_info/geneinfo2mongodb.sh index 94d22ec..1c45a12 100755 --- a/databases/gene_info/geneinfo2mongodb.sh +++ b/databases/gene_info/geneinfo2mongodb.sh @@ -16,8 +16,8 @@ R -f get_datasets.R echo "INFO OK." date echo "INFO Importing to MongoDB..." -cat gene_info_grch37.csv | sudo docker container exec -i bio_api_mongo_db mongoimport --verbose=1 --host $ip_mongo --port $port_mongo --username $user --password $password --drop --stopOnError --db $db --collection gene_grch37 --authenticationDatabase admin --type csv --headerline --ignoreBlanks -cat gene_info_grch38.csv | sudo docker container exec -i bio_api_mongo_db mongoimport --verbose=1 --host $ip_mongo --port $port_mongo --username $user --password $password --drop --stopOnError --db $db --collection gene_grch38 --authenticationDatabase admin --type csv --headerline --ignoreBlanks +cat gene_info_grch37.csv | sudo docker container exec -i bio_api_mongo_db mongoimport --verbose=1 --host $ip_mongo --port $port_mongo --username $user --password $password --drop --stopOnError --db $db --collection gene_grch37 --authenticationDatabase admin --type csv --headerline +cat gene_info_grch38.csv | sudo docker container exec -i bio_api_mongo_db mongoimport --verbose=1 --host $ip_mongo --port $port_mongo --username $user --password $password --drop --stopOnError --db $db --collection gene_grch38 --authenticationDatabase admin --type csv --headerline echo "INFO OK." date echo "INFO Creating indexes..." diff --git a/databases/gene_info/get_datasets.R b/databases/gene_info/get_datasets.R index 761779c..460ffc0 100644 --- a/databases/gene_info/get_datasets.R +++ b/databases/gene_info/get_datasets.R @@ -1,11 +1,15 @@ -if (!requireNamespace("BiocManager", quietly = TRUE)) - install.packages("BiocManager",repos = "http://cran.us.r-project.org") +if (!require("BiocManager", quietly = TRUE)) + install.packages("BiocManager") +BiocManager::install(version = "3.18", ask = FALSE) if("biomaRt" %in% rownames(installed.packages()) == FALSE) BiocManager::install("biomaRt", force = TRUE) -if("dplyr" %in% rownames(installed.packages()) == FALSE) - install.packages("dplyr",repos = "http://cran.us.r-project.org") +if("devtools" %in% rownames(installed.packages()) == FALSE) + install.packages("devtools", repos = "http://cran.us.r-project.org") + +if("dbplyr" %in% rownames(installed.packages()) == FALSE) + devtools::install_version("dbplyr", version = "2.3.4", force = TRUE) if("GeneSummary" %in% rownames(installed.packages()) == FALSE) BiocManager::install("GeneSummary", update = FALSE, ask = FALSE, force = FALSE) @@ -14,14 +18,13 @@ library(GeneSummary) library(dplyr) library(biomaRt) library(stringr) +library(BiocManager) civicUrl="https://civicdb.org/downloads/nightly/nightly-GeneSummaries.tsv" -try(civicUrl <- Sys.getenv(CIVIC_URL)) +try(civicUrl <- Sys.getenv(CIVIC_URL), silent = TRUE) -# BUSQUEDA PARA GRCh38 #### -ensembl_grch38 = useEnsembl(biomart="genes", dataset = "hsapiens_gene_ensembl") -# posibles_filtros <- listFilters(ensembl_grch38) -# posibles_atributos <- listAttributes(ensembl_grch38) +# GRCh38 #### +ensembl_grch38 = useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl") atr <- c("ensembl_gene_id", "description", "chromosome_name", "start_position", "end_position", "strand", "band", "percentage_gene_gc_content", "gene_biotype", "hgnc_symbol", "hgnc_id", "entrezgene_id") fil <- c("with_hgnc") res_grch38 <- getBM( @@ -33,10 +36,10 @@ res_grch38 <- getBM( res_grch38$description <- data.frame(do.call('rbind', str_split(res_grch38$description, " \\[Source")))$X1 -# BUSQUEDA PARA GRCh37 #### +# GRCh37 #### ensembl_grch37 = useEnsembl(biomart="genes", dataset = "hsapiens_gene_ensembl", GRCh = 37) -# posibles_filtros <- listFilters(ensembl_grch37) -# posibles_atributos <- listAttributes(ensembl_grch37) +# listFilters(ensembl_grch37) +# listAttributes(ensembl_grch37) atr <- c("ensembl_gene_id", "description", "chromosome_name", "start_position", "end_position", "strand", "band", "percentage_gene_gc_content", "gene_biotype", "hgnc_symbol", "hgnc_id", "entrezgene_id") fil <- c("with_hgnc") res_grch37 <- getBM( @@ -49,9 +52,6 @@ res_grch37 <- getBM( res_grch37$description <- data.frame(do.call('rbind', str_split(res_grch37$description, " \\[Source")))$X1 - - - # Summary of genes for Humans from RefSeq RefSeqSummaryGenes = loadGeneSummary(organism = 9606)[, c(4,6)] # col 4: EntrezID. col 6: Summary of RefSeq gene RefSeqSummaryGenesFiltered <- distinct(RefSeqSummaryGenes, Gene_ID, .keep_all = TRUE) # remove duplpicates ids diff --git a/databases/oncokb/oncokb_oncology_therapies_tsv2json.py b/databases/oncokb/oncokb_oncology_therapies_tsv2json.py index 00e63cb..754d67d 100644 --- a/databases/oncokb/oncokb_oncology_therapies_tsv2json.py +++ b/databases/oncokb/oncokb_oncology_therapies_tsv2json.py @@ -93,24 +93,22 @@ def search_gene(biomarker_description: str) -> List[str]: } for special_case in special_cases: if re.search(special_case, biomarker_description): - genes = special_cases[special_case] + genes.extend(special_cases[special_case]) for i in range(0, len(hgnc_genes)): if re.search(valid_genes_compiled_patern[i], biomarker_description): - if hgnc_genes[i] not in genes: - genes.append(hgnc_genes[i]) + genes.append(hgnc_genes[i]) a = 0 for alias in hgnc_aliases: if re.search(aliases_compiled_patern[a], biomarker_description): - if hgnc_aliases[alias] not in genes: - genes.append(hgnc_aliases[alias]) + genes.append(hgnc_aliases[alias]) a += 1 p = 0 for prev in hgnc_previous: if re.search(previous_compiled_patern[p], biomarker_description): - if hgnc_previous[prev] not in genes: - genes.append(hgnc_previous[prev]) + genes.append(hgnc_previous[prev]) p += 1 + genes = list(set(genes)) # Remove duplicate genes return genes diff --git a/docker-compose.dev.yml b/docker-compose.dev.yml index e1fa5fd..cb52ff2 100644 --- a/docker-compose.dev.yml +++ b/docker-compose.dev.yml @@ -24,8 +24,8 @@ services: volumes: mongo_data: - external: - name: 'bio_api_mongo_data' + external: true + name: 'bio_api_mongo_data' mongo_config: - external: - name: 'bio_api_mongo_config' + external: true + name: 'bio_api_mongo_config' diff --git a/docker-compose_dist.yml b/docker-compose_dist.yml index 036c017..19664fe 100755 --- a/docker-compose_dist.yml +++ b/docker-compose_dist.yml @@ -21,7 +21,7 @@ services: # BioAPI server bioapi: - image: omicsdatascience/bio-api:1.1.1 + image: omicsdatascience/bio-api:1.2.0 restart: always volumes: - bioapi_logs:/logs