diff --git a/README.md b/README.md index 981d302..959de6e 100755 --- a/README.md +++ b/README.md @@ -2,25 +2,31 @@ ### Overview -The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). +The germline variant annotator (*gvanno*) is a simple, software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework. *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. -#### Annotation resources included in _gvanno_ - 1.3.0 +#### Annotation resources included in _gvanno_ - 1.3.2 -* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.0 (GENCODE v33/v19 as the gene reference dataset) -* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.0, May 2019) +* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.2 (GENCODE v34/v19 as the gene reference dataset) +* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020) * [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP * [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP -* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (April 2020) -* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019) -* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_04, April 2020) -* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_02, April 2020) -* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v32, Sept 2018) -* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 3rd 2020) +* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2020) +* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020) +* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_06, June 2020) +* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_03, June 2020) +* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020) +* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (June 13th 2020) ### News +* June 30th 2020 - **1.3.2 release** + * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform, Pfam, dbNSFP) + * Using GENCODE v34 as the correct transcript assembly for grch38 (see [issue](https://github.com/Ensembl/ensembl-vep/issues/749)) + * Three new variant effect predictions from dbNSFP added: [ClinPred](https://doi.org/10.1016/j.ajhg.2018.08.005), [LIST-S2](https://doi.org/10.1093/nar/gkaa288), and [BayesDel](https://doi.org/10.1002/humu.23158) + * Added VEP plugin NearestExonJB + * Annotates relative position (to the exon-intron junction) of variants in introns and exons * May 8th 2020 - **1.3.0 release** * Upgrade of VEP (v100) - GENCODE release 33 (grch38) * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform) @@ -55,29 +61,28 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha - CPUs: minimum 4 - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced) -##### STEP 1.1: Installation of Singularity (optional) -0. NB, this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions. +##### 1.1: Installation of Singularity (optional) + +0. **Note: this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions**. 1. [Install Singularity](https://sylabs.io/docs/) 2. Test that singularity works by running `singularity --version` 3. If you are in the gvanno directory, build the singularity image like so: `cd src` - Now you can build the singularity image with: - `sudo ./buildSingularity.sh` #### STEP 2: Download *gvanno* and data bundle -1. Download and unpack the [latest software release (1.3.0)](https://github.com/sigven/gvanno/releases/tag/v1.3.0) +1. Download and unpack the [latest software release (1.3.2)](https://github.com/sigven/gvanno/releases/tag/v1.3.2) 2. Download and unpack the assembly-specific data bundle in the gvanno directory - * [grch37 data bundle](https://drive.google.com/file/d/1TfDVsWw47t-1mG9bSsewOMC3vkGTssEy) (approx 16Gb) - * [grch38 data bundle](https://drive.google.com/file/d/1K93S7cTVKfIspf2_zpL7KSOGuzk6tlfs) (approx 17Gb) + * [grch37 data bundle](https://drive.google.com/file/d/1XJT8sSngl5T3HHQK2CZtZuwXX3rouEYg/) (approx 16Gb) + * [grch38 data bundle](https://drive.google.com/file/d/1M6gioFzvt6XOqRDTx4UXYD5sIVOH55IY) (approx 17Gb) * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -` A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced -3. Pull the [gvanno Docker image (1.3.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb): - * `docker pull sigven/gvanno:1.3.0` (gvanno annotation engine) +3. Pull the [gvanno Docker image (1.3.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb): + * `docker pull sigven/gvanno:1.3.2` (gvanno annotation engine) #### STEP 3: Input preprocessing @@ -105,7 +110,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt positional arguments: query_vcf VCF input file with germline query variants (SNVs/InDels) - gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.1.0 + gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.3.2 output_dir Output directory {grch37,grch38} grch37 or grch38 configuration_file gvanno configuration file (TOML format) @@ -121,8 +126,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command: -`python ~/gvanno-1.3.0/gvanno.py ~/gvanno-1.3.0/examples/example.grch37.vcf.gz --container docker` -` ~/gvanno-1.3.0 ~/gvanno-1.3.0/examples grch37 ~/gvanno-1.3.0/gvanno.toml example` +`python ~/gvanno-1.3.2/gvanno.py ~/gvanno-1.3.2/examples/example.grch37.vcf.gz --container docker` +` ~/gvanno-1.3.2 ~/gvanno-1.3.2/examples grch37 ~/gvanno-1.3.2/gvanno.toml example` This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder: diff --git a/gvanno.py b/gvanno.py index 5069d2e..2c35fec 100755 --- a/gvanno.py +++ b/gvanno.py @@ -12,10 +12,8 @@ import toml from argparse import RawTextHelpFormatter - - -gvanno_version = '1.3.1' -db_version = 'GVANNO_DB_VERSION = 20200514' +gvanno_version = '1.3.2' +db_version = 'GVANNO_DB_VERSION = 20200629' vep_version = '100' global vep_assembly @@ -377,14 +375,16 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz") ancestor_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "human_ancestor.fa.gz") loftee_dir = '/opt/vep/src/ensembl-vep/modules' + plugins_in_use = "NearestExonJB" vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \ - "--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq" + "--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin NearestExonJB,max_range=50000" vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(config_options['other']['vep_pick_order']) + \ " --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + \ str(config_options['other']['n_vep_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data" if config_options['other']['vep_skip_intergenic'] == 1: vep_options = vep_options + " --no_intergenic" if config_options['other']['lof_prediction'] == 1: + plugins_in_use = plugins_in_use + ", LoF" vep_options += " --plugin LoF,loftee_path:" + loftee_dir + ",human_ancestor_fa:" + str(ancestor_assembly) + ",use_gerp_end_trunc:0 --dir_plugins " + loftee_dir vep_main_command = str(container_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + docker_command_run_end vep_bgzip_command = container_command_run1 + "bgzip -f -c " + str(vep_vcf) + " > " + str(vep_vcf) + ".gz" + docker_command_run_end @@ -392,10 +392,12 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id logger = getlogger('gvanno-vep') print() - if config_options['other']['lof_prediction'] == 1: - logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ") including loss-of-function prediction") - else: - logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")") + logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")") + logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flack_pick_allele)") + logger.info("VEP configuration - transcript pick order: " + str(config_options['other']['vep_pick_order'])) + logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options") + logger.info("VEP configuration - skip intergenic: " + str(config_options['other']['vep_skip_intergenic'])) + logger.info("VEP configuration - plugins in use: " + str(plugins_in_use)) check_subprocess(vep_main_command) check_subprocess(vep_bgzip_command) check_subprocess(vep_tabix_command) diff --git a/src/gvanno.tgz b/src/gvanno.tgz index 2686549..6c54935 100644 Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py index 44a0ecf..1066537 100755 --- a/src/gvanno/gvanno_vcfanno.py +++ b/src/gvanno/gvanno_vcfanno.py @@ -44,8 +44,8 @@ def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvann """ Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files """ - clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_MEDGEN_CUI", - "CLINVAR_MEDGEN_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_HGVSP", + clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP", + "CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT", "CLINVAR_REVIEW_STATUS_STARS"] dbnsfp_info_tags = ["DBNSFP"] uniprot_info_tags = ["UNIPROT_FEATURE"] @@ -89,7 +89,8 @@ def append_to_vcf_header(gvanno_db_directory, datasource, vcfheader_file): def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, conf_fname): """ - Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). The datasource defines the set of tags that will be appended during annotation + Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). + The datasource defines the set of tags that will be appended during annotation """ fh = open(conf_fname,'a') if datasource != 'uniprot' and datasource != 'gvanno_xref': diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py index f6f9e2b..94113dc 100755 --- a/src/gvanno/lib/annoutils.py +++ b/src/gvanno/lib/annoutils.py @@ -45,6 +45,15 @@ def check_subprocess(command): print (e.output.decode()) exit(0) + +def is_integer(n): + try: + float(n) + except ValueError: + return False + else: + return float(n).is_integer() + def error_message(message, logger): logger.error('') logger.error(message) @@ -257,10 +266,6 @@ def read_config_options(configuration_file, base_dir, genome_assembly, logger, w if isinstance(config_options[section][var],str) and not isinstance(user_options[section][var],str): err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting string)' error_message(err_msg, logger) - # if section == 'tumor_type' and var == 'type': - # if not str(user_options[section][var]) in valid_tumor_types: - # err_msg('Configuration value for tumor type (' + str(user_options[section][var]) + ') is not a valid type') - # error_message(err_msg, logger) normalization_options = ['default','exome','genome','exome2genome'] populations_tgp = ['afr','amr','eas','sas','eur','global'] populations_gnomad = ['afr','amr','eas','sas','nfe','oth','fin','asj','global'] @@ -343,7 +348,7 @@ def threeToOneAA(aa_change): return aa_change def map_variant_effect_predictors(rec, algorithms): - + dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms) if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None: return @@ -390,10 +395,17 @@ def map_variant_effect_predictors(rec, algorithms): rec.INFO['DEOGEN2_DBNSFP'] = str(algo_pred.split(':')[1]) if algo_pred.startswith('primateai:'): rec.INFO['PRIMATEAI_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('list_s2:'): + rec.INFO['LIST_S2_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('bayesdel_addaf:'): + rec.INFO['BAYESDEL_ADDAF_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('clinpred:'): + rec.INFO['CLINPRED_DBNSFP'] = str(algo_pred.split(':')[1]) if algo_pred.startswith('splice_site_rf:'): rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1]) if algo_pred.startswith('splice_site_ada:'): rec.INFO['SPLICE_SITE_ADA_DBNSFP'] = str(algo_pred.split(':')[1]) + def detect_reserved_info_tag(tag, tag_name, logger): @@ -413,6 +425,8 @@ def assign_cds_exon_intron_annotations(csq_record): csq_record['EXONIC_STATUS'] = 'nonexonic' csq_record['SPLICE_DONOR_RELEVANT'] = False csq_record['NULL_VARIANT'] = False + csq_record['INTRON_POSITION'] = 0 + csq_record['EXON_POSITION'] = 0 coding_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|protein_altering|inframe_)" wes_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|inframe_|protein_altering|synonymous)" @@ -429,6 +443,22 @@ def assign_cds_exon_intron_annotations(csq_record): if re.match(r"^splice_region_variant", str(csq_record['Consequence'])) and re.search(r'(\+3(A|G)>|\+4G>|\+5G>)', str(csq_record['HGVSc'])): csq_record['SPLICE_DONOR_RELEVANT'] = True + if re.match(r"splice_region_variant|intron_variant", str(csq_record['Consequence'])): + match = re.search(r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc'])) + if match is not None: + pos = re.sub(r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)","",match.group(0)) + if is_integer(pos): + csq_record['INTRON_POSITION'] = int(pos) + + if re.match(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) and str(csq_record['NearestExonJB']) != "": + exon_pos_info = csq_record['NearestExonJB'].split("+") + if len(exon_pos_info) == 4: + if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end": + csq_record['EXON_POSITION'] = -int(exon_pos_info[1]) + if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start": + csq_record['EXON_POSITION'] = int(exon_pos_info[1]) + + for m in ['HGVSp_short','CDS_CHANGE']: csq_record[m] = '.' if not csq_record['HGVSc'] is None: diff --git a/test_examples_docker.py b/test_examples_docker.py new file mode 100755 index 0000000..106a046 --- /dev/null +++ b/test_examples_docker.py @@ -0,0 +1,13 @@ +#!/usr/bin/env python + +import os + +wd = os.getcwd() + +cmd_grch37 = "python gvanno.py " + os.path.join(wd,"examples","example.grch37.vcf.gz") + " " + wd + " " + wd + " grch37 gvanno.toml example --container docker --no_vcf_validate --force_overwrite" +cmd_grch38 = "python gvanno.py " + os.path.join(wd,"examples","example.grch38.vcf.gz") + " " + wd + " " + wd + " grch38 gvanno.toml example --container docker --no_vcf_validate --force_overwrite" + +os.system(cmd_grch37) +os.system(cmd_grch38) + +