1.3.2 release

sigven · Jun 30, 2020 · 048ed86 · 048ed86
1 parent 93b80f0
commit 048ed86
Show file tree

Hide file tree

Showing 6 changed files with 90 additions and 39 deletions.
diff --git a/README.md b/README.md
@@ -2,25 +2,31 @@
 
 ### Overview
 
-The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below).
+The germline variant annotator (*gvanno*) is a simple, software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
 
 *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
 
-#### Annotation resources included in _gvanno_ - 1.3.0
+#### Annotation resources included in _gvanno_ - 1.3.2
 
-* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.0 (GENCODE v33/v19 as the gene reference dataset)
-* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.0, May 2019)
+* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.2 (GENCODE v34/v19 as the gene reference dataset)
+* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
 * [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
 * [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
 * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (April 2020)
-* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
-* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_04, April 2020)
-* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_02, April 2020)
-* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v32, Sept 2018)
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 3rd 2020)
+* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2020)
+* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
+* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_06, June 2020)
+* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_03, June 2020)
+* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
+* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (June 13th 2020)
 
 ### News
+* June 30th 2020 - **1.3.2 release**
+     * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform, Pfam, dbNSFP)
+	     * Using GENCODE v34 as the correct transcript assembly for grch38 (see [issue](https://github.com/Ensembl/ensembl-vep/issues/749))
+		* Three new variant effect predictions from dbNSFP added: [ClinPred](https://doi.org/10.1016/j.ajhg.2018.08.005), [LIST-S2](https://doi.org/10.1093/nar/gkaa288), and [BayesDel](https://doi.org/10.1002/humu.23158)
+	* Added VEP plugin NearestExonJB
+	     * Annotates relative position (to the exon-intron junction) of variants in introns and exons
 * May 8th 2020 - **1.3.0 release**
      * Upgrade of VEP (v100) - GENCODE release 33 (grch38)
 	* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
@@ -55,29 +61,28 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
    - CPUs: minimum 4
    - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
 
-##### STEP 1.1: Installation of Singularity (optional)
-0. NB, this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions.
+##### 1.1: Installation of Singularity (optional)
+
+0. **Note: this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions**.
 1. [Install Singularity](https://sylabs.io/docs/)
 2. Test that singularity works by running `singularity --version`
 3. If you are in the gvanno directory, build the singularity image like so:
 
 	`cd src`
 
-  	Now you can build the singularity image with:
-
 	`sudo ./buildSingularity.sh`
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Download and unpack the [latest software release (1.3.0)](https://github.com/sigven/gvanno/releases/tag/v1.3.0)
+1. Download and unpack the [latest software release (1.3.2)](https://github.com/sigven/gvanno/releases/tag/v1.3.2)
 2. Download and unpack the assembly-specific data bundle in the gvanno directory
-   * [grch37 data bundle](https://drive.google.com/file/d/1TfDVsWw47t-1mG9bSsewOMC3vkGTssEy) (approx 16Gb)
-   * [grch38 data bundle](https://drive.google.com/file/d/1K93S7cTVKfIspf2_zpL7KSOGuzk6tlfs) (approx 17Gb)
+   * [grch37 data bundle](https://drive.google.com/file/d/1XJT8sSngl5T3HHQK2CZtZuwXX3rouEYg/) (approx 16Gb)
+   * [grch38 data bundle](https://drive.google.com/file/d/1M6gioFzvt6XOqRDTx4UXYD5sIVOH55IY) (approx 17Gb)
    * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
     A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (1.3.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
-   * `docker pull sigven/gvanno:1.3.0` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (1.3.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
+   * `docker pull sigven/gvanno:1.3.2` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -105,7 +110,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 		positional arguments:
 		query_vcf           VCF input file with germline query variants (SNVs/InDels)
-		gvanno_dir          gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.1.0
+		gvanno_dir          gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.3.2
 		output_dir          Output directory
 		{grch37,grch38}     grch37 or grch38
 		configuration_file  gvanno configuration file (TOML format)
@@ -121,8 +126,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
 
-`python ~/gvanno-1.3.0/gvanno.py ~/gvanno-1.3.0/examples/example.grch37.vcf.gz --container docker`
-` ~/gvanno-1.3.0 ~/gvanno-1.3.0/examples grch37 ~/gvanno-1.3.0/gvanno.toml example`
+`python ~/gvanno-1.3.2/gvanno.py ~/gvanno-1.3.2/examples/example.grch37.vcf.gz --container docker`
+` ~/gvanno-1.3.2 ~/gvanno-1.3.2/examples grch37 ~/gvanno-1.3.2/gvanno.toml example`
 
 This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
 

diff --git a/gvanno.py b/gvanno.py
@@ -12,10 +12,8 @@
 import toml
 from argparse import RawTextHelpFormatter
 
-
-
-gvanno_version = '1.3.1'
-db_version = 'GVANNO_DB_VERSION = 20200514'
+gvanno_version = '1.3.2'
+db_version = 'GVANNO_DB_VERSION = 20200629'
 vep_version = '100'
 global vep_assembly
 
@@ -377,25 +375,29 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz")
       ancestor_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "human_ancestor.fa.gz")
       loftee_dir = '/opt/vep/src/ensembl-vep/modules'
+      plugins_in_use = "NearestExonJB"
       vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \
-         "--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
+         "--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin NearestExonJB,max_range=50000"
       vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(config_options['other']['vep_pick_order']) + \
          " --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + \
          str(config_options['other']['n_vep_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
       if config_options['other']['vep_skip_intergenic'] == 1:
          vep_options = vep_options + " --no_intergenic"
       if config_options['other']['lof_prediction'] == 1:
+         plugins_in_use = plugins_in_use + ", LoF"
          vep_options += " --plugin LoF,loftee_path:" + loftee_dir + ",human_ancestor_fa:" + str(ancestor_assembly)  + ",use_gerp_end_trunc:0 --dir_plugins " + loftee_dir
       vep_main_command = str(container_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + docker_command_run_end
       vep_bgzip_command = container_command_run1 + "bgzip -f -c " + str(vep_vcf) + " > " + str(vep_vcf) + ".gz" + docker_command_run_end
       vep_tabix_command = str(container_command_run1) + "tabix -f -p vcf " + str(vep_vcf) + ".gz" + docker_command_run_end
       logger = getlogger('gvanno-vep')
 
       print()
-      if config_options['other']['lof_prediction'] == 1:
-         logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ") including loss-of-function prediction")
-      else:
-         logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")")
+      logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")")
+      logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flack_pick_allele)")
+      logger.info("VEP configuration - transcript pick order: " + str(config_options['other']['vep_pick_order']))
+      logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options")
+      logger.info("VEP configuration - skip intergenic: " + str(config_options['other']['vep_skip_intergenic']))
+      logger.info("VEP configuration - plugins in use: " + str(plugins_in_use))
       check_subprocess(vep_main_command)
       check_subprocess(vep_bgzip_command)
       check_subprocess(vep_tabix_command)

diff --git a/src/gvanno.tgz b/src/gvanno.tgz
diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py
@@ -44,8 +44,8 @@ def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvann
    """
    Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
    """
-   clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_MEDGEN_CUI",
-                        "CLINVAR_MEDGEN_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_HGVSP",
+   clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP",
+                        "CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT",
                         "CLINVAR_REVIEW_STATUS_STARS"]
    dbnsfp_info_tags = ["DBNSFP"]
    uniprot_info_tags = ["UNIPROT_FEATURE"]
@@ -89,7 +89,8 @@ def append_to_vcf_header(gvanno_db_directory, datasource, vcfheader_file):
 
 def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, conf_fname):
    """
-   Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). The datasource defines the set of tags that will be appended during annotation
+   Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). 
+   The datasource defines the set of tags that will be appended during annotation
    """
    fh = open(conf_fname,'a')
    if datasource != 'uniprot' and datasource != 'gvanno_xref':

diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py
@@ -45,6 +45,15 @@ def check_subprocess(command):
       print (e.output.decode())
       exit(0)
 
+
+def is_integer(n):
+    try:
+        float(n)
+    except ValueError:
+        return False
+    else:
+        return float(n).is_integer()
+
 def error_message(message, logger):
    logger.error('')
    logger.error(message)
@@ -257,10 +266,6 @@ def read_config_options(configuration_file, base_dir, genome_assembly, logger, w
             if isinstance(config_options[section][var],str) and not isinstance(user_options[section][var],str):
                err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting string)'
                error_message(err_msg, logger)
-            # if section == 'tumor_type' and var == 'type':
-            #    if not str(user_options[section][var]) in valid_tumor_types:
-            #       err_msg('Configuration value for tumor type (' + str(user_options[section][var]) + ') is not a valid type')
-            #       error_message(err_msg, logger)
             normalization_options = ['default','exome','genome','exome2genome']
             populations_tgp = ['afr','amr','eas','sas','eur','global']
             populations_gnomad = ['afr','amr','eas','sas','nfe','oth','fin','asj','global']
@@ -343,7 +348,7 @@ def threeToOneAA(aa_change):
    return aa_change
 
 def map_variant_effect_predictors(rec, algorithms):
-    
+
    dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms)
    if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None:
       return
@@ -390,10 +395,17 @@ def map_variant_effect_predictors(rec, algorithms):
                rec.INFO['DEOGEN2_DBNSFP'] = str(algo_pred.split(':')[1])
             if algo_pred.startswith('primateai:'):
                rec.INFO['PRIMATEAI_DBNSFP'] = str(algo_pred.split(':')[1])
+            if algo_pred.startswith('list_s2:'):
+               rec.INFO['LIST_S2_DBNSFP'] = str(algo_pred.split(':')[1])
+            if algo_pred.startswith('bayesdel_addaf:'):
+               rec.INFO['BAYESDEL_ADDAF_DBNSFP'] = str(algo_pred.split(':')[1])
+            if algo_pred.startswith('clinpred:'):
+               rec.INFO['CLINPRED_DBNSFP'] = str(algo_pred.split(':')[1])
             if algo_pred.startswith('splice_site_rf:'):
                rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1])
             if algo_pred.startswith('splice_site_ada:'):
                rec.INFO['SPLICE_SITE_ADA_DBNSFP'] = str(algo_pred.split(':')[1])
+
 
 
 def detect_reserved_info_tag(tag, tag_name, logger):
@@ -413,6 +425,8 @@ def assign_cds_exon_intron_annotations(csq_record):
    csq_record['EXONIC_STATUS'] = 'nonexonic'
    csq_record['SPLICE_DONOR_RELEVANT'] = False
    csq_record['NULL_VARIANT'] = False
+   csq_record['INTRON_POSITION'] = 0
+   csq_record['EXON_POSITION'] = 0
 
    coding_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|protein_altering|inframe_)"
    wes_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|inframe_|protein_altering|synonymous)"
@@ -429,6 +443,22 @@ def assign_cds_exon_intron_annotations(csq_record):
    if re.match(r"^splice_region_variant", str(csq_record['Consequence'])) and re.search(r'(\+3(A|G)>|\+4G>|\+5G>)', str(csq_record['HGVSc'])):
       csq_record['SPLICE_DONOR_RELEVANT'] = True
 
+   if re.match(r"splice_region_variant|intron_variant", str(csq_record['Consequence'])):
+      match = re.search(r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc']))
+      if match is not None:
+         pos = re.sub(r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)","",match.group(0))
+         if is_integer(pos):
+            csq_record['INTRON_POSITION'] = int(pos)
+
+   if re.match(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) and str(csq_record['NearestExonJB']) != "":
+      exon_pos_info = csq_record['NearestExonJB'].split("+")
+      if len(exon_pos_info) == 4:
+         if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end":
+            csq_record['EXON_POSITION'] = -int(exon_pos_info[1])
+         if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start":
+            csq_record['EXON_POSITION'] = int(exon_pos_info[1])
+
+
    for m in ['HGVSp_short','CDS_CHANGE']:
       csq_record[m] = '.'
    if not csq_record['HGVSc'] is None:

diff --git a/test_examples_docker.py b/test_examples_docker.py
@@ -0,0 +1,13 @@
+#!/usr/bin/env python
+
+import os
+
+wd = os.getcwd()
+
+cmd_grch37 = "python gvanno.py " + os.path.join(wd,"examples","example.grch37.vcf.gz") + " " + wd + " " + wd  + " grch37 gvanno.toml example --container docker --no_vcf_validate --force_overwrite"
+cmd_grch38 = "python gvanno.py " + os.path.join(wd,"examples","example.grch38.vcf.gz") + " " + wd + " " + wd  + " grch38 gvanno.toml example --container docker --no_vcf_validate --force_overwrite"
+
+os.system(cmd_grch37)
+os.system(cmd_grch38)
+
+