Skip to content

Latest commit

 

History

History
1161 lines (898 loc) · 58.5 KB

README.md

File metadata and controls

1161 lines (898 loc) · 58.5 KB

RTX-KG2 Continous Integration

KG2: the second-generation RTX knowledge graph

KG2 is the second-generation knowledge graph for the ARAX biomedical reasoning system. This Github repository contains all of the code for building KG2 as well as all of the documentation about how to build, host, access, and use KG2. The KG2 build system produces knowledge graphs in a Biolink model standard-compliant JSON format and in a tab-separated value (TSV) format that can be imported into a Neo4j graph database system. Through additional scripts in the ARAX kg2c subdirectory, the build system can produce a "canonicalized" knowledge graph where synonym concepts (nodes) are identified. Through additional scripts in the mediKanren subdirectory, the build system can produce an export of the KG2 knowledge graph that is suitable for importing into the mediKanren biomedical reasoning system.

KG2 team contact information

KG2 Team

Bug reports

Please use the GitHub issues page for this project.

Is RTX-KG2 published?

Yes, please see:

Wood, E.C., Glen, A.K., Kvarfordt, L.G. et al. RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicine. BMC Bioinformatics 23, 400 (2022). https://doi.org/10.1186/s12859-022-04932-3

The preprint can be found at: doi:10.1101/2021.10.17.464747.

How to access RTX-KG2

Neo4j read-only endpoint for RTX KG2 as a graph database

(RTX-KG2 team members only: contact the KG2 maintainer for the endpoint, username, and password)

What data sources are used in KG2?

Information from many knowledge databases is combined in building KG2. The table below was compiled from the Snakemake diagram and ont-load-inventory.yaml.

Knowledge Source Type Redistribution license info Home page
ChemBL data link link
DGIDB data link link
DisGeNET data link link
DrugBank data link link
DrugCentral data link
Ensembl data link link
GO_Annotations data link
Guide to Pharmacology data link
HMDB data link
IntAct data link
JensenLab data link
miRBase data link link
NCBIGene data link
PathWhiz data link
Reactome data link link
RepoDB data link
SemMedDB data link link
SMPDB data link link
Therapuetic Target Database data link
Unichem data link
UniprotKB data link link
Anatomical Therapeutic Chemical Classification System ontology link
Basic Formal Ontology ontology link
Biolink meta-model ontology link
Biological Spatial Ontology ontology link
Cell Ontology ontology link
Chemical Entities of Biological Interest ontology link
CPT in HCPCS ontology link
Current Procedural Terminology ontology link
Dictyostelium discoideum anatomy ontology link
Disease Ontology ontology link
Experimental Factor Ontology ontology link
FOODON (Food Ontology) ontology link
Foundational Model of Anatomy ontology link
Gene Ontology ontology link
Gene Ontology ontology link
Genomic Epidemiology Ontology ontology link
Healthcare Common Procedure Coding System ontology link
HL7 Version 3.0 ontology link
HUGO Gene Nomenclature Committee ontology link
Human developmental anatomy, abstract ontology link
Human Phenotype Ontology ontology link
ICD-10 Procedure Coding System ontology link
ICD-10, American English Equivalents ontology link
Interaction Network Ontology ontology link
International Classification of Diseases and Related Health Problems, ontology link
International Classification of Diseases, Ninth Revision, Clinical Modification ontology link
International Classification of Diseases, Tenth Revision, Clinical Modification ontology link
Logical Observation Identifiers Names and Codes ontology link
MedDRA ontology link
Medical Subject Headings ontology link
Medication Reference Terminology ontology link
MedlinePlus Health Topics ontology link
Metathesaurus Names ontology link
Molecular Interactions Controlled Vocabulary ontology link
MONDO Disease Ontology ontology link
National Drug Data File ontology link
National Drug File ontology link
National Drug File - Reference Terminology ontology link
NCBITaxon ontology link
NCI Thesaurus ontology link
Neuro Behavior Ontology ontology link
Online Mendelian Inheritance in Man ontology link link
ORPHANET Rare Disease Ontology ontology link
Phenotypic Quality Ontology ontology link
Physician Data Query ontology link
Protein Ontology ontology link
Psychological Index Terms ontology link
Relation Ontology ontology link
RXNORM ontology link
SNOMED Clinical Terms US Edition ontology link link
Uber-anatomy Ontology ontology link
UMLS Semantic Types ontology link link

How to build RTX KG2 from its upstream sources

General notes:

The KG2 build system is designed only to run in an Ubuntu 18.04 environment (i.e., either (i) an Ubuntu 18.04 host OS or (ii) Ubuntu 18.04 running in a Docker container) as a non-root user which must have passwordless sudo enabled and should have bash as the default shell (the build commands in the instructions in this README page assume a bash shell). The build system will also need (but will set up for itself, prompting the user for access keys at setup time) a local configured installation of the Amazon Web Services (AWS) command-line interface (CLI) software in order to be able to retrieve various required files on-demand from a storage bucket in the AWS Simple Storage Service (S3) system. Currently, KG2 is built using a set of bash scripts that are designed to run in Amazon's Elastic Compute Cloud (EC2), and thus, configurability and/or coexisting with other installed software pipelines was not a design consideration for the build system. The KG2 build system's bash scripts create three subdirectories under the ${HOME} directory of whatever Linux user account you use to run the KG2 build software (if you run on an EC2 Ubuntu instance, this directory would by default be /home/ubuntu):

  1. ~/kg2-build, where various build artifacts are stored
  2. ~/kg2-code, which is a symbolic link to the git checkout directory RTX-KG2/
  3. ~/kg2-venv, which is the virtualenv for the KG2 build system

The various directories used by the KG2 build system are configured in the bash include file master-config.shinc. Most of the KG2 build system code is written in the Python3 programming language, and designed to run in python3.7 (and tested specifically in python 3.7.5).

Note about atomicity of file moving: The build software is designed to run with the kg2-build directory being in the same file system as the Python temporary file directory (i.e., the directory name that is returned by the variable tempfile.tempdir in Python). If the KG2 software or installation is modified so that kg2-build is in a different file system from the file system in which the directory tempfile.tempdir (as referenced in the tempfile python module) resides, then the file moving operations that are performed by the KG2 build software will not be atomic and interruption of build-kg2.sh or its subprocesses could then leave a source data file in a half-downloaded (i.e., broken) state.

Build Frequency: We are currently aiming to build KG2 approximately once per month, to keep it as current as feasible given the cost to build and validate KG2 from its upstream sources.

Setup your computing environment

The computing environment where you will be running the KG2 build should be running Ubuntu 18.04. Your build environment should have the following minimum hardware specifications:

  • 256 GiB of system memory
  • 1,023 GiB of disk space in the root file system
  • high-speed networking (20 Gb/s networking) and storage
  • if you are on the RTX-KG2 team: ideally your build system should be in the AWS region us-west-2 since that is where the RTX KG2 S3 buckets are located

The KG2 build system assumes there is no MySQL already installed

The target Ubuntu system in which you will run the KG2 build should not have MySQL installed; if MySQL is already installed, you will need to delete it, which you can do using the following bash command, which requires curl: (WARNING! Please don't run this command without first making a backup image of your system, such as an AMI):

source <(curl -s https://raw.githubusercontent.com/RTXteam/RTX-KG2/master/delete-mysql-ubuntu.sh)

The KG2 build system has been tested only under Ubuntu 18.04. If you want to build KG2 but don't have a native installation of Ubuntu 18.04 available, your best bet would be to use Docker (see Option 3 below).

AWS buckets

In order to be able to build KG2, you'll need to have at least one AWS S3 bucket set up (or use an existing bucket; for the KG2 creators, we use S3 three buckets, s3://rtx-kg2, s3://rtx-kg2-public, and s3://rtx-kg2-versioned, which are in the us-west-2 AWS region) and you will need to have an AWS authentication key pair that is configured to be able to read from (and write to) the bucket(s), so that the build script can download a copy of the full Unified Medical Language System (UMLS) distribution. The full UMLS distribution (including SNOMED CT) (umls-2022AA-metathesaurus.zip; IANAL, but it appears that the UMLS is encumbered by a license preventing redistribution so I have not hosted them on a public server for download; but you can get it for free at the UMLS website if you agree to the UMLS license terms)) and the DrugBank distribution (drugbank.xml.gz) will need to be pre-placed in the S3 bucket and the local copy of master-config.shinc will need to be configured so that variables s3_bucket, s3_bucket_public, and s3_bucket_versioned point to the S3 bucket(s) and so that the shell variable s3_region identifies the AWS region in which the bucket(s) reside(s).

AWS authentication

For the KG2 build system that we (the creators of KG2) have set up for use by Team Expander Agent, the authentication key pair is associated with an IAM account with username kg2-builder; if you are setting up the KG2 build system somewhere else, you will need to obtain your own AWS authentication key pair that connects to an IAM account (or root AWS account, if you want to live dangerously) that has S3 privileges to read from and write to the S3 buckets that are configured in your local copy of master-config.shinc. When you run the KG2 setup script, you will be asked (by the AWS Command-line Interface, CLI) to provide an authentication key pair. and it uploads the final output file kg2-simplified.json.gz to the buckets identified by the shell variables s3_bucket defined in master-config.shinc (for the KG2 creators, that bucket is s3://rtx-kg2). Alternatively, you can set up your own S3 bucket to which to copy the gzipped KG2 JSON file (which you would specify in the configuration file master-config.shinc), or in the file finish-snakemake.sh, you can comment out the line that copies the final gzipped JSON file to the S3 bucket. You will also need to edit (to fill in the correct Neo4j password) and place a file RTXConfiguration-config.json (template is in the KG2 source code directory) into the S3 bucket identified by the shell variable s3_bucket in master-config.shinc (for the KG2 creators, that bucket is s3://rtx-kg2/); As a minimal example of the data format for RTXConfiguration-config.json, see the file RTXConfiguration-config-EXAMPLE.json in this repository code directory (note: that config file can contain authentication information for additional server types in the RTX system; those are not shown in the example file in this code directory).

Typical EC2 instance type used for building KG2

The KG2 build software has been tested with the following instance type:

  • AMI: Ubuntu Server 18.04 LTS (HVM), SSD Volume Type - ami-005bdb005fb00e791 (64-bit x86)
  • Instance type: r5a.8xlarge (256 GiB of memory)
  • Storage: 1,023 GiB, Elastic Block Storage
  • Security Group: ingress TCP packets on port 22 (ssh) permitted

As of summer 2020, an on-demand r5a.8xlarge instance in the us-west-2 AWS region costs $1.808 per hour, so the cost to build KG2 (estimated to take 54 hours with Snakemake) would be approximately $98 (rough estimate, plus or minus 20%). (Unfortunately, AWS doesn't seem to allow the provisioning of spot instances while specifying minimum memory greater than 240 GiB; but perhaps soon that will happen, and if so, it could save significantly on the cost of updating the RTX KG2.)

Build instructions

Note: to follow the instructions for Option 3 and Option 4 below, in addition to the requirements as described above, you will need to be using the bash shell on your local computer.

Build Option 1: build KG2 in parallel directly on an Ubuntu system:

These instructions assume that you are logged into the target Ubuntu system, and that the Ubuntu system has not previously had setup-kg2-build.sh run (if it has previously had setup-kg2-build.sh run, you should first clear out the instance by running clear-instance.sh before proceeding, in order to ensure that you are getting the exact python packages needed in the latest requirements-kg2-build.txt file in the KG2 codebase) and to ensure that your build does not inadvertantly reuse artifacts from a previous RTX-KG2 build:

(1) Install the git and screen packages if they are not already installed (though in an Ubuntu 18.04 instance created using the standard AWS AMI, they should already be installed):

sudo apt-get update && sudo apt-get install -y screen git

(2) change to the home directory for user ubuntu:

cd 

(3) Clone the RTX software from GitHub:

git clone https://github.com/RTXteam/RTX-KG2.git

[An advantage to having the git clone command separated out from the install script is that it provides control over which branch you want to use for the KG2 build code.]

(4) Setup the KG2 build system:

bash -x RTX-KG2/setup-kg2-build.sh

Note that there is no need to redirect stdout or stderr to a log file, when executing setup-kg2-build.sh; this is because the script saves its own stdout and stderr to a log file ~/kg2-build/setup-kg2-build.log. This script takes just a few minutes to complete. At some point, the script will print

fatal error: Unable to locate credentials

This is normal. The script will then prompt you to enter:

  • your AWS Access Key ID
  • your AWS Secret Access Key
    • (both for an AWS account with access to the private S3 bucket that is configured in master-config.shinc)
  • your default AWS region, which in our case is normally us-west-2
    • (you should enter the AWS region that hosts the private S3 bucket that you intend to use with the KG2 build system)
  • When prompted Default output format [None], just hit enter/return.

For KG2 builders on the RTX-KG2 team, just use the keypair for the kg2-builder IAM user.

If all goes well, the setup script should end with the message:

upload: ../setup-kg2-build.log to s3://rtx-kg2-versioned/setup-kg2-build.log

printed to the console. The aforementioned message means that the logfile from running the setup script has been archived in the rtx-kg2-versioned S3 bucket.

(5) Look in the log file ~/kg2-build/setup-kg2-build.log to see if the script completed successfully; it should end with ======= script finished ======. In that case it is safe to proceed.

(6) [THIS STEP IS NORMALLY SKIPPED] If (and only if) you have made code changes to KG2 that will cause a change to the schema for KG2 (or added a major new upstream source database), you will want to increment the "major" release number for KG2. To do that, at this step of the build process, you would run this command:

touch ~/kg2-build/major-release

[MORE COMMON ALTERNATIVE] For regular releases, you want to increment the "minor" release number. This is for situations where changes to the code have been made and the build will likely be deployed. If you want to increment the "minor" release number for KG2, you would run this command:

touch ~/kg2-build/minor-release

If you don't increment the release number at all, you should not be planning to deploy the build. This is useful for cases where you are testing the build system, but not necessarily different code or bug fixes.

(7) Run a "dry-run" build:

bash -x ~/kg2-code/build-kg2-snakemake.sh all -F -n

and inspect the file ~/kg2-build/build-kg2-snakemake-n.log that will be created, to make sure that all of the KG2 build tasks are included. Currently, the file should end with the following count of tasks:

Job counts:
        count   jobs
        1       ChEMBL
        1       ChEMBL_Conversion
        1       DGIdb
        1       DGIdb_Conversion
        1       DisGeNET
        1       DisGeNET_Conversion
        1       DrugBank
        1       DrugBank_Conversion
        1       DrugCentral
        1       DrugCentral_Conversion
        1       Ensembl
        1       Ensembl_Conversion
        1       Finish
        1       GO_Annotations
        1       GO_Annotations_Conversion
        1       HMDB
        1       HMDB_Conversion
        1       IntAct
        1       IntAct_Conversion
        1       JensenLab
        1       Jensenlab_Conversion
        1       KEGG
        1       KEGG_Conversion
        1       Merge
        1       NCBIGene
        1       NCBIGene_Conversion
        1       Ontologies_and_TTL
        1       Reactome
        1       Reactome_Conversion
        1       RepoDB
        1       RepoDB_Conversion
        1       SMPDB
        1       SMPDB_Conversion
        1       SemMedDB
        1       SemMedDB_Conversion
        1       Simplify
        1       Simplify_Stats
        1       Slim
        1       Stats
        1       TSV
        1       UMLS
        1       UniChem
        1       UniChem_Conversion
        1       UniProtKB
        1       UniProtKB_Conversion
        1       ValidationTests
        1       miRBase
        1       miRBase_Conversion
        48
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
+ date
Thu Aug  5 00:00:40 UTC 2021
+ echo '================ script finished ============================'
================ script finished ============================

Assuming the log file looks correct, proceed.

(8) Initiate a screen session to provide a stable pseudo-tty:

screen

(then hit return to get into the screen session).

(9) THIS STEP COMMENCES THE BUILD. Within the screen session, run:

bash -x ~/kg2-code/build-kg2-snakemake.sh all -F

You may exit out of the screen session using the ctrl-a d key sequence. The all command line argument specifies that you would like to run a full build. This is the best option if you are running on a new instance, or have added upstream sources. Otherwise, consider the following options:

Partial Build of KG2

In some circumstances, if there are no updates to any of the upstream source databases (like UMLS, ChEMBL, SemMedDB, etc.) that are extracted using extract*.sh scripts (as shown in the list of KG2 scripts), you can trigger a "partial" build that just downloads the OBO ontologies and does a build downstream of that. This can be useful in cases where you are testing a change to one of the YAML configuration files for KG2, for example. To do a partial build, in Step (8) above, you would run

bash -x ~/kg2-code/build-kg2-snakemake.sh

(note the absence of the all argument to build-kg2-snakemake.sh). A partial build of KG2 may take about 31 hours. Note, you have to have previously run an all build of KG2, or else the partial build will not work. Note, when doing a partial build, existing KG2 JSON files in the /home/ubuntu/kg2-build directory from previous builds will just get used and will not get updated; if you want any of those files to get updated, you should delete them before running the partial build.

Test Build of KG2

For testing/debugging purposes, it is helpful to have a faster way to exercise the KG2 build code. For this, you may want to execute a "test" build. This build mode builds a smaller graph with a significantly reduced set of nodes and edges. Before you can do a test build, you must have previously done a full non-test build of KG2 (i.e., build-kg2.sh all) at least once. To execute a full test build, in Step (8) above, you would run:

bash -x ~/kg2-code/build-kg2-snakemake.sh alltest

In the case of a test build, the a couple log file names are changed:

   ~/kg2-build/build-kg2-snakemake-test.log
   ~/kg2-build/build-kg2-ont-test-stderr.log

and all of the intermediate JSON and TSV files that the build system creates will have -test appended to the filename before the usual filename suffix (.json).

Partial Test Build of KG2

To run a partial build of KG2 in "test" mode, the command would be:

bash -x ~/kg2-code/build-kg2-snakemake.sh test

This option is frequently used in testing/development. Note, you have to have previously run an alltest build, or else a test build will not work.

Note that there is no need to redirect stdout or stderr to a log file, when executing build-kg2-snakemake.sh; this is because the script saves its own stdout and stderr to a log file ~/kg2-build/build-kg2-snakemake.log. You can watch the progress of your KG2 build by using this command:

tail -f ~/kg2-build/build-kg2-snakemake.log

That file shows what has finished and what is still happening. If any line says

(exited with non-zero exit code)

the code has failed. However, since the code is running in parallel, to minimize confusion, stdout and stderr for many of the scripts is piped into its own final, including:

  • build-multi-ont-kg.sh -> ~/kg2-build/build-multi-ont-kg.log
  • dgidb_tsv_to_kg_json.py -> ~/kg2-build/dgidb/dgidb-tsv-to-kg-stderr.log
  • download-repodb-csv.sh -> ~/kg2-build/download-repodb-csv.log
  • drugbank_xml_to_kg_json.py -> ~/kg2-build/drugbank-xml-to-kg-json.log
  • extract-chembl.sh -> ~/kg2-build/extract-chembl.log
  • extract-dgidb.sh -> ~/kg2-build/extract-dgidb.log
  • extract-drugbank.sh -> ~/kg2-build/extract-drugbank.log
  • extract-ensembl.sh -> ~/kg2-build/extract-ensembl.log
  • extract-go-annotations.sh -> ~/kg2-build/extract-go-annotations.log
  • extract-hmdb.sh -> ~/kg2-build/extract-hmdb.log
  • extract-kegg.sh -> ~/kg2-build/extract-kegg.log
  • extract-ncbigene.sh -> ~/kg2-build/extract-ncbigene.log
  • extract-semmeddb.sh -> ~/kg2-build/extract-semmeddb.log
  • extract-smpdb.sh -> ~/kg2-build/extract-smpdb.log
  • extract-umls.sh -> ~/kg2-build/extract-umls.log
  • extract-uniprotkb.sh -> ~/kg2-build/extract-uniprotkb.log
  • extract-unichem.sh -> ~/kg2-build/extract-unichem.log
  • filter_kg_and_remap_predicates.py -> ~/kg2-build/filter_kg_and_remap_predicates.log
  • go_gpa_to_kg_json.py -> ~/kg2-build/go-gpa-to-kg-json.log
  • hmdb_xml_to_kg_json.py -> ~/kg2-build/hmdb-xml-to-kg-json.log
  • run-validation-tests.sh -> ~/kg2-build/run-validation-tests.log
  • semmeddb_tuple_list_json_to_kg_json.py -> ~/kg2-build/semmeddb-tuple-list-json-to-kg-json.log
  • smpdb_csv_to_kg_json.py -> ~/kg2-build/smpdb/smpdb-csv-to-kg-json.log

If a build using Snakemake fails and the output file for the rule it failed on doesn't exist, you can continue the build such that it only reruns the rule(s) that don't already have an output file and all of the rules after that rule(s). For example, if a build fails on multi_ont_to_json_kg.py, wait for the build to completely fail (build-kg2-snakemake.sh won't be running at all, which you can check using top or htop), then change the following line in build-kg2-snakemake.sh to have it run multi_ont_to_json_kg.py, merge_graphs.py, etc.

Normal Line:

cd ~ && ${VENV_DIR}/bin/snakemake --snakefile ${snakefile} -F -j

New Line:

cd ~ && ${VENV_DIR}/bin/snakemake --snakefile ${snakefile} -R Finish -j

Note the -F, which forces all rules that lead up to Finish -- the first rule in the Snakefile -- to run, regardless of the existence of output files, has changed to -R Finish, which only forces the rule that failed and the rules that depend on that rule's output to run. You can always add -n if you're unsure of what rules your edited snakemake command will run: this will cause snakemake to do a dry-run, which just prints the snakemake rules that will be run to the log file without actually running them.

At the end of the build process, you should inspect the logfile ~/kg2-build/filter_kg_and_remap_predicates.log to see if there are warnings like relation curie is missing from the YAML config file: CURIEPREFIX:some_predicate where CURIEPREFIX could be any CURIE prefix in curies-to-urls-map.yaml and some_predicate is a snake-case predicate label (or in the case of Relation Ontology, a numeric identifier). Any warnings of the above format in filter_kg_and_remap_predicates.log probably indicates that an addition needs to be made to the file predicate-remap.yaml, followed by a partial rebuild starting with filter_kg_and_remap_predicates.py(the Simplify rule).

What to do if a build fails

  • Let's suppose the build failed on the rule UniChem. In that case, you could fix the bug and then test your bugfix by running /home/ubuntu/kg2-venv/bin/snakemake --snakefile /home/ubuntu/kg2-code/Snakefile -R --until UniChem which just runs that rule. Note, you should only use the above command after you have run build-kg2-snakemake.sh (as in Step 8 above) at least once, otherwise you will get an error because the required Snakefile ~/kg2-code/Snakefile will not yet exist. Assuming that the above command is successful, you could then proceed.

  • Restart the full build:

bash -x ~/kg2-code/build-kg2-snakemake.sh all

(Note, you only need the all above if the rule is for an "extract-XXX.sh" script; if it is for a rule that is downstream of the extract scripts, you can omit all.

Note about versioning of KG2

KG2 has semantic versioning with a graph/major/minor release system:

  • The graph release number is always 2.
  • The major release number is incremented when the schema for KG2 is changed (and the minor release is set to zero in that case)
  • The minor release number is incremented for each non-test build for which the schema is not modified.

So an example version of KG2 would be "RTX KG 2.1.3" (graph release 2, major release 1, minor release 3). This build version is recorded in three places:

  • the top-level build slot in the KG2 JSON file
  • in the name field of a node object with id field RTX:KG2 (in both the JSON version of the KG and in the Neo4j version of the KG)
  • the file s3://rtx-kg2-public/kg2-version.txt in the S3 bucket rtx-kg2-public.

By default, the KG2 build process (as outlined above) will automatically increment the minor release number and update the file kg2-version.txt in the S3 bucket. If you are doing a build in which the KG2 schema has changed, you should trigger the incrementing of the major release version by making sure to do step (6) above. The build script (specifically, the script version.sh) will automatically delete the file ~/kg2-build/major-release so that it will not persist for the next build. Note: if the build system happens to terminate unexpectedly while running version.sh, or after the Simplify rule, you should check what state the files3://rtx-kg2-public/kg2-version.txt was left in.

The version history for KG2 can be found here.

Build Option 2: build KG2 serially (about 67 hours) directly on an Ubuntu system (DEPRECATED):

This method is deprecated. Click here to view steps anyway.

(1)-(7) Follow steps (1)-(7) in Build Option 1.

(8) Within the screen session, run:

bash -x ~/kg2-code/build-kg2-DEPRECATED.sh all

Then exit screen (ctrl-a d). Note that there is no need to redirect stdout or stderr to a log file, when executing build-kg2-DEPRECATED.sh; this is because the script saves its own stdout and stderr to a log file build-kg2.log. You can watch the progress of your KG2 build by using this command:

tail -f ~/kg2-build/build-kg2.log

Note that the build-multi-ont-kg.sh script also saves stderr from running multi_ont_to_json_kg.py to a file ~/kg2-build/build-kg2-ont-stderr.log.

Partial build of KG2

Caution: Be sure to remove any files that should not be in the build. Highly recommend rm kg2-build/kg2*json

Like with the parallel build system, you can run a sequential partial build. To do a partial build, in Step (8) above, you would run

bash -x ~/kg2-code/build-kg2-DEPRECATED.sh

(note the absence of the all argument to build-kg2-DEPRECATED.sh). A partial build of KG2 may take about 40 hours. Note, you have to have previously run an all build of KG2, or else the partial build will not work.

Test build of KG2

To execute a sequential test build, in Step (8) above, you would run:

bash -x ~/kg2-code/build-kg2-DEPRECATED.sh alltest

In the case of a test build, the build log file names are changed:

~/kg2-build/build-kg2-test.log
~/kg2-build/build-kg2-ont-test-stderr.log

and all of the intermediate JSON and TSV files that the build system creates will have -test appended to the filename before the usual filename suffix (.json).

Partial test build of KG2

To run a partial sequential build of KG2 in "test" mode, the command would be:

bash -x ~/kg2-code/build-kg2-DEPRECATED.sh test

Build Option 3: setup ssh key exchange so you can build KG2 in a remote EC2 instance

This option requires that you have curl installed on your local computer. In a bash terminal session, set up the remote EC2 instance by running this command (requires ssh installed and in your path):

source <(curl -s https://raw.githubusercontent.com/RTXteam/RTX-KG2/master/ec2-setup-remote-instance.sh)

You will be prompted to enter the path to your AWS PEM file and the hostname of your AWS instance. The script should then initiate a bash session on the remote instance. Within that bash session, continue to follow the instructions for Build Option 1, starting at step (4).

Build Option 4: In an Ubuntu container in Docker

Click here to view steps
For Build Option 4, you will need a *lot* of disk space (see disk storage

requirements above) in the root file system, unless you modify the Docker installation to store containers in some other (non-default) file system location. Here are the instructions:

(1) Install Docker. If you are on Ubuntu 18.04 and you need to install Docker, you can run this command in bash on the host OS:

source <(curl -s https://raw.githubusercontent.com/RTXteam/RTX-KG2/master/install-docker-ubuntu18.sh)

(otherwise, the subsequent commands in this section assume that Docker is installed on whatever host system you are running). For some notes on how to install Docker on MacOS via the Homebrew system, see macos-docker-notes.md. NOTE: if your docker installation (like on macOS Homebrew) does not require sudo, just omit sudo everywhere you see sudo docker in the steps below.

(2) Build a Docker image kg2:latest:

sudo docker image build -t kg2 https://raw.githubusercontent.com/RTXteam/RTX-KG2/master/Dockerfile 

(3) Create a container called kg2 from the kg2:latest image

sudo docker create --name kg2 kg2:latest

(4) Start the kg2 container:

sudo docker start kg2

(5) Open a bash shell as user root inside the container:

sudo docker exec -it kg2 /bin/bash

(6) Become user ubuntu:

su - ubuntu

Now follow the instructions for Build Option 1 above.

Possible failure modes for the KG2 build

Occasionally a build will fail due to a connection error in attempting to cURL a file from one of the upstream sources (e.g., SMPDB, and less frequently, UniChem).

Another failure mode is the versioning of ChemBL. Once ChemBL upgrades their dataset, old datasets may become unavailable. This will result in failure when downloading. To fix this, change the version number in extract-chembl.sh.

The output KG

The build-kg2.sh script (run via one of the three methods shown above) creates a gzipped JSON file kg2-simplified.json.gz and copies it to an S3 bucket rtx-kg2. You can access the gzipped JSON file using the AWS command-line interface (CLI) tool aws with the command

aws s3 cp s3://rtx-kg2/kg2-simplified.json.gz .

The TSV files for the knowledge graph can be accessed via HTTP as well,

aws s3 cp s3://rtx-kg2/kg2-tsv.tar.gz .

You can access the various artifacts from the KG2 build (config file, log file, etc.) at the AWS static website endpoint for the rtx-kg2-public S3 bucket: http://rtx-kg2-public.s3-website-us-west-2.amazonaws.com/

Each build of KG2 is labeled with a unique build date/timestamp. The build timestamp can be found in the build slot of the kg2-simplified.json file and it can be found in the node with ID RTX:KG2 in the Neo4j KG2 database. Due to the size of KG2, we are not currently archiving old builds of KG2 and that is why kg2-simplified.json and the related large KG2 JSON files are stored in a non-versioned S3 bucket.

Optional KG2 PubMed Build

Click here to view steps

To add PubMed ID nodes and Pubmed->MeSH edges to your KG2, you can add those for every PubMed ID referenced in KG2 (whether in an edge - publications, publications_info - or node - publications). This process isn't currently optimized.

(1) Build KG2 up through the merge step (merge_graphs.py).

(2) Generate a list of PMIDs referenced in KG2 in a screen session:

~/kg2-venv/bin/python3 ~/kg2-code/extract_kg2_pmids.py ~/kg2-build/kg2.json ~/kg2-build/pmids-in-kg2.json

(3) Potentially at the same time as step 2 -- this step doesn't take much memory -- download the PubMed XML files.

bash -x ~/kg2-code/extract-pubmed.sh

(4) On an r5a.16xlarge (or instance with comparable memory) instance with the PubMed XML files and the list of PMIDs in KG2 as a JSON file, build your KG2 JSON file for PubMed. This json file will be approximately 66GB large.

~/kg2-venv/bin/python3 ~/kg2-code/pubmed_xml_to_kg_json.py ~/kg2-build/pubmed ~/kg2-build/pmids-in-kg2.json ~/kg2-build/kg2-pubmed.json

(5) The format of kg2-pubmed.json matches kg2.json but not kg2-simplified.json. For this reason, at this time, we have to merge kg2-pubmed.json into kg2.json. Then, a kg2-simplified.json can be make from the output. Eventually, it might be preferred to have kg2-pubmed.json generated to match the format of kg2-simplified.json, especially since its predicates do not have to go through the predicate remap process and loading kg2-pubmed.json into memory takes a lot of memory. UNTESTED.

~/kg2-venv/bin/python3 ~/kg2-code/merge_graphs.py --kgFileOrphanEdges ~/kg2-build/kg2-pubmed-merge-orphan-edges.json --outputFile ~/kg2-build/kg2-with-pubmed.json ~/kg2-build/kg2.json ~/kg2-build/kg2-pubmed.json

(6) Run the filter_kg_and_remap_predicates.py script on this new JSON file (and optionally get_nodes_json_from_kg_json.py and report_stats_on_json_kg.py -- you can't run these in parallel due to memory considerations, so be aware of what is absolutely necessary to generate). UNTESTED

~/kg2-venv/bin/python3 ~/kg2-code/filter_kg_and_remap_predicates.py ~/kg2-code/predicate-remap.yaml ~/kg2-build/kg2-with-pubmed.json ~/kg2-build/kg2-with-pubmed-simplified.json

(7) Generate TSV (files for the new, simplified JSON file (and optionally run get_nodes_json_from_kg_json.py and report_stats_on_json_kg.py on the simplified JSON file). UNTESTED

rm -rf ~/kg2-build/PubMedKG2TSV/
mkdir -p ~/kg2-build/PubMedKG2TSV/
~/kg2-venv/bin/python3 ~/kg2-code/kg_json_to_tsv.py ~/kg2-code/kg2-with-pubmed-simplified.json ~/kg2-code/PubMedKG2TSV	

Updating the installed KG2 build system software

We generally try to make the KG2 shell scripts idempotent, following best practice for *nix shell scripting. However, changes to setup-kg2-build.sh (or setup-kg2-neo4j.sh) that would bring in a new version of a major software dependency (e.g., Python) of the KG2 build system are not usually tested for whether they can also upgrade an existing installation of the build system; this is especially an issue for software dependencies that are installed using apt-get. In the event that setup-kg2-build.sh undergoes a major change that would trigger such an upgrade (e.g., from Python3.7 to Python3.8), instead of rerunning setup-kg2-build.sh on your existing build system, we recommend that you create a clean Ubuntu 18.04 instance and install using setup-kg2-build.sh.

Hosting KG2 in a Neo4j server on a new AWS instance

We host our production KG2 graph database in Neo4j version 3.5.13 with APOC 3.5.0.4, on an Ubuntu 18.04 EC2 instance with 64 GiB of RAM and 8 vCPUs (r5a.2xlarge) in the us-east-2 AWS region.

Installation: in a newly initialized Ubuntu 18.04 AWS instance, as user ubuntu, run the following commands:

(1) Make sure you are in your home directory:

cd

(2) Clone the RTX software from GitHub:

git clone https://github.com/RTXteam/RTX-KG2.git

(3) Install and configure Neo4j, with APOC:

RTX-KG2/setup-kg2-neo4j.sh

This script takes just a few minutes to complete. At some point, the script will print

fatal error: Unable to locate credentials

This is normal. The script will then prompt you to enter your AWS Access Key ID and AWS Secret Access Key, for an AWS account with access to the private S3 bucket that is configured in master-config.shinc. It will also ask you to enter your default AWS region; you should enter the AWS region that hosts the private S3 bucket that you intend to use with the KG2 build system, which in our case would be us-west-2. When prompted Default output format [None], just hit enter/return. Also, the setup script will print a warning

WARNING: Max 1024 open files allowed, minimum of 40000 recommended. See the Neo4j manual.

but this, too, can be ignored [The /lib/systemd/service/neo4j.service file that is installed (indirectly) by the setup script actually sets the limit to 60000, for when the Neo4j database system is run via systemd (but when running neo4j-admin at the CLI to set the password, Neo4j doesn't know this and it reports a limit warning).]

(4) Look in the log file ${HOME}/setup-kg2-neo4j.log to see if the script completed successfully; it should end with ======= script finished ======.

(5) Start up a screen session, and within that screen session, load KG2 into Neo4j:

RTX-KG2/tsv-to-neo4j.sh > ~/kg2-build/tsv-to-neo4j.log 2>&1

This script takes over three hours to complete.

(6) Look in the log file ~/kg2-build/tsv-to-neo4j.log to see if the script completed successfully; it should end with ======= script finished ======.

Reloading KG2 into an existing Neo4j server

Once you have loaded KG2 into Neo4j as described above, if you want to reload KG2, just run (as user ubuntu):

~/RTX-KG2/tsv-to-neo4j.sh > ~/kg2-build/tsv-to-neo4j.log 2>&1

Co-hosting the KG2 build system and Neo4j server?

In theory, it should be possible to install Neo4j and load KG2 into it on the same Ubuntu instance where KG2 was built; but this workflow is usually not tested since in our setup, we nearly always perform the KG2 build and Neo4j hosting on separate AWS instances. This is because the system requirements to build KG2 are much greater than the system requirements to host KG2 in Neo4j.

Post-setup tasks

  • We typically define a DNS CNAME record for the KG2 Neo4j server hostname, of the form kg2endpoint-kg2-X-Y.rtx.ai, where X is the major version number and Y is the minor version number.

  • Before you release a new build of KG2, please update the version history markdown file with the new build version and the numbers of the GitHub issues that are addressed/implemented in the new KG2 version.

    • After a build has successfully completed, add a tag with the kg2 version number
    • Follow the format "KG2.X.Y", where X is the major version number and Y is the minor version number
     git tag -a KG2.X.Y -m "<name of build host used>"
     git push --tags
    
  • Wherever possible we try to document the name of the build host (EC2 instance) used for the KG2 build in kg2-versions.md and we try to preserve the kg2-build directory and its contents on that host, until a new build has superseded the build. Having the build directory available on the actual build host is very useful for tracking down the source of an unexpected relationship or node property. Any new data sources in the build or major updates (e.g., DrugBank, UMLS, or ChEMBL) should also be noted in the kg2-versions.md file.

  • One of the key build artifacts that should be inspected in order to assess the build quality is the JSON report kg-simplified-report.json. This file should be inspected as a part of the post-build quality assessment process.

Schema of the JSON KG2

The file kg2.json is an intermediate file that is probably only of use to KG2 developers. The file kg2-simplified.json is a key artifact of the build process that feeds into several downstream artifacts and may be of direct use to application developers. Newlines, carriage returns, linefeed characters, or hard tabs are not allowed in any string property or in any string scalar within a list property in KG2. The kg2-simplified.json JSON data structure is a name-value pair object (i.e., dictionary) with the following keys:

build slot

The top-level build slot contains a dictionary whose keys are:

  • version: a string containing the version identifier for the KG2 build, like RTX KG2.2.3. For a "test" build, the version identifier will have -TEST appended to it.
  • timestamp_utc: a string containing the ISO 8601 date/timestamp (in UTC) for the build, like this: 2020-08-11 21:51.

nodes slot

The top-level nodes slot contains a list of node objects. Each node object has the following keys:

  • category: a string containing a CURIE ID for the semantic type of the node, as a category in the Biolink model. Example: biolink:Gene.
  • category_label: a snake_case representation of the category field, without the biolink: CURIE prefix.
  • creation_date: a string identifier of the date in which this node object was first created in the upstream source database; it has (at present) no consistent format, unfortunately (usual value is null).
  • deprecated: a Boolean field indicating whether or not this node has been deprecated by the upstream source database (usual value is false).
  • description: a narrative description field for the node, in prose text
  • full_name: a longer name for the node (often is identical to the name field)
  • id: a CURIE ID for the node; this CURIE ID will be unique across nodes in KG2 (that constraint is enforced in the build process)
  • iri: a URI where the user can get more information about this node (we try to make these resolvable wherever possible)
  • name: a display name for the node
  • knowledge_source: A CURIE ID (which corresponds to an actual node in KG2) for the upstream information resource that is the definitive source for information about this node.
  • provided_by: This slot is deprecated. Refer to knowledge_source.
  • publications: a list of CURIE IDs of publications (e.g., PMID or ISBN or DOI identifiers) that contain information about this node
  • replaced_by: a CURIE ID for the node that replaces this node, for cases when this node has been deprecated (usually it is null).
  • synonym: a list of strings with synonyms for the node; if the node is a gene, the first entry in the list should be the official gene symbol; other types of information can for certain node types be found in this list, such as protein sequence information for UniProt protein nodes. The entries in the node synonym property (which is of type list) are not guaranteed to be id fields of actual nodes in KG2. Also, they are not comprehensive; if node Y is related to node X by a biolink:same_as relation type, there is no guarantee that Y will be in the synonym property list for X (in most cases, it won't be).
  • update date: a string identifier of the date in which the information for this node object was last updated in the upstream source database; it has (at present) no consitent format, unfortunately; it is usually not null.
  • has_biological_sequence: a string of sequence information for nodes from DrugBank (SMILES), ChemBL (Canonical SMILES), HMDB (SMILES), miRBase ("sequence" - appears to be amino acids), and UniprotKB ("sequence" - also appears to be amino acids). For nodes from other sources, this property is null.

edges slot

  • edges: a list of edge objects. Each edge object has the following keys:
    • relation_label: a snake_case representation of the plain English label for the original predicate for the edge provided by the upstream source database (see the relation field)
    • negated: a Boolean field indicating whether or not the edge relationship is "negated"; usually false, in the normal build process for KG2
    • object: the CURIE ID (id) for the KG2 node that is the object of the edge
    • knowledge_source: A list containing CURIE IDs (each of which corresponds to an actual node in KG2) for the upstream information resources that reported this edge's specific combination of subject/predicate/object (in the case of multiple providers for an edge, the other fields like publications are merged from the information from the multiple sources).
    • publications: a list of CURIE IDs of publications supporting this edge (e.g., PMID or ISBN or DOI identifiers)
    • publications_info: a dictionary whose keys are CURIE IDs from the list in the publications field, and whose values are described in the next subsection ("publication_info")
    • predicate_label: a snake_case representation of the plain English label for the simplified predicate (see the predicate field); in most cases this is a predicate type from the Biolink model.
    • predicate: a CURIE ID for the simplified relation
    • subject: the CURIE ID (id) for the KG2 node that is the subject of the edge
    • update_date: a string identifier of the date in which the information for this node object was last updated in the upstream source database; it has (at present) no consitent format, unfortunately; it is usually not null.
    • id: a concatenated string of other edge attributes that uniquely identifies the edge. it follows the format subject---relation---object---provided_by.
    • source_predicate: a CURIE ID for the relation as reported by the upstream database source.
    • provided_by: deprecated. Refer to knowledge_source.
    • relation: deprecated. See source_predicate.

publications_info slot

If it is not null, the publications_info object's values are objects containing the following name/value pairs:

  • publication date: string representation of the date of the publication, in ISO 8601 format (%Y-%m-%d %H:%i:%S)
  • sentence: a string containing the natural language sentence from which the edge was inferred (this is only not null for SemMedDB edges, at present)
  • subject score: a string containing a confidence score; for SemMedDB edges, this score corresponds to a confidence with which the subject of the triple was correctly identified; for other edges (like ChEMBL drug to target predictions), the score corresponds to a confidence in a computational prediction of the ligand-to-target binding relationship; NOTE: there at present no unified scale for this field, unfortunately
  • object score: for SemMedDB edges, this score corresponds to a confidence with which the subject of the triple was correctly identified; otherwise null

Biolink compliance

KG2 aims to comply with the Biolink knowledge graph format.

Files generated by the KG2 build system (UNDER DEVELOPMENT)

  • kg2-simplified.json: This is the main KG2 graph, in JSON format (48 GiB).
  • kg2-slim.json: This is the simplified KG2 graph with a restricted set of node and edge properties included.
  • kg2.json: This is the KG2 graph before Biolink predicates are added; it is only of interest to KG2 developers.
  • kg2-simplified-report.json: A JSON report giving statistics on the kg2-simplified.json knowledge graph.
  • kg2-version.txt: Tracks the version of the last build of KG2.

Frequently asked questions

Where can I download a pre-built copy of KG2?

Dump files of RTX-KG2pre and RTX-KG2c are available for download in the github:ncats/translator-lfs-artifacts project area.

What licenses cover KG2?

It's complicated. The KG2 build software is provided free-of-charge via the MIT license. All documentation for KG2 and any downloadable build artifacts hosted on GitHub or S3 are provided free-of-charge via the (CC-BY license)[https://creativecommons.org/licenses/by/4.0/]. If you are using KG2 in your work, we ask that you attribute credit to the KG2 team as follows: RTX KG2 development team, github.com/RTXteam. Our assertion of the CC-BY license covers only creative product our team (documentation, reports, and knowledge graph formatting); the actual content of the KG2 knowledge graph is encumbered by various licenses (e.g., UMLS) that prevent its redistribution.

What criteria do you use to select sources to include in KG2?

We emphasize knowledge souces that

  1. Are available in a flat-file download (e.g., TSV, XML, JSON, DAT, or SQL dump)
  2. Are being maintained and updated periodically
  3. Provide content/knowledge that complements (does not duplicate) what is already in KG2.
  4. Connect concept identifiers that are already in KG2.
  5. Ideally, provide knowledge based on human curation (favored over computational text-mining).

Troubleshooting (UNDER DEVELOPMENT)

Errors in multi_ont_to_json_kg.py

Errors in convert_bpv_predicate_to_curie

  • An error like the following:
File "/home/ubuntu/kg2-code/multi_ont_to_json_kg.py", line 1158, in convert_bpv_predicate_to_curie
     raise ValueError('unable to expand CURIE: ' + bpv_pred)
ValueError: unable to expand CURIE: MONARCH:cliqueLeader     

would indicate that the CURIE prefix (in this case, MONARCH) needs to be added to the use_for_bidirectional_mapping section of curies-to-urls-map.yaml config file.

Error building DAG of jobs

  • In the case where Snakemake is forcibly quit due to a loss of power or other reason, it may result in the code directory becoming locked. To resolve, run:
/home/ubuntu/kg2-venv/bin/snakemake --snakefile /home/ubuntu/kg2-code/Snakefile --unlock

Authentication Error in tsv-to-neo4j.sh

Sometimes, when hosting KG2 in a Neo4j server on a new AWS instance, the initial password does not get set correctly, which will lead to an Authentication Error in tsv-to-neo4j.sh. To fix this, do the following:

  1. Start up Neo4 (sudo service neo4j start)
  2. Wait one minute, then confirm Neo4j is running (sudo service neo4j status)
  3. Use a browser to connect to Neo4j via HTTP on port 7474. You should see a username/password authentication form.
  4. Fill in "neo4j" and "neo4j" for username and password, respectively, and submit the form. You should be immediately prompted to set a new password. At that time, type in our "usual" Neo4j password (you'll have to enter it twice).
  5. When you submit the form, Neo4j should be running and it should now have the correct password set.

Errors in Extraction rules

Role exists error

Occasionally, when a database needs to be re-extracted, the error ERROR: role "jjyang" already exists occurs. If the following is not in the extraction script, add it to the line above where the role is created.

sudo -u postgres psql -c "DROP ROLE IF EXISTS ${role}"

For Developers

This section has some guidelines for the development team for the KG2 build system.

KG2 coding standards

  • Hard tabs are not permitted in source files such as python or bash (use spaces).

Python coding standards for KG2

  • Only python3 is allowed.
  • Please follow PEP8 formatting standards, except we allow line length to go to 160.
  • Please use type hints wherever possible.

Shell coding standards for KG2

  • Use lower-case for variable names except for environment variables.
  • The flags nounset, pipefail, and errexit should be set.

File naming

  • For config files and shell scripts, use kabob-case
  • For python modules, use snake_case.

Credits

Thank you to the many people who have contributed to the development of RTX KG2:

Code and development work

Stephen Ramsey, E. C. Wood, Amy Glen, Lindsey Kvarfordt, Finn Womack, Liliana Acevedo, Veronica Flores, and Deqing Qu.

Advice and feedback

David Koslicki, Eric Deutsch, Yao Yao, Jared Roach, Chris Mungall, Tom Conlin, Matt Brush, Chunlei Wu, Harold Solbrig, Will Byrd, Michael Patton, Jim Balhoff, Chunyu Ma, Chris Bizon,
Deepak Unni, Richard Bruskiewich, and Jeff Henrikson.

Funding

National Center for Advancing Translational Sciences (award number OT2TR002520).