0.8.0 update

sigven · Mar 21, 2019 · 8b5e122 · 8b5e122
1 parent b5578f5
commit 8b5e122
Show file tree

Hide file tree

Showing 12 changed files with 240 additions and 164 deletions.
diff --git a/README.md b/README.md
@@ -6,20 +6,23 @@ The germline variant annotator (*gvanno*) is a simple, Docker-based software pac
 
 *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
 
-#### Annotation resources included in _gvanno_ - 0.7.0
+#### Annotation resources included in _gvanno_ - 0.8.0
 
 * [VEP v95](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor (GENCODE v29/v19 as the gene reference dataset)
-* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (December 2018)
+* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (February 2019)
 * [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP
 * [dbSNP build 151](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (October 2017) - from VEP
 * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar 20190103](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (January 2019)
+* [ClinVar 20190305](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (March 2019)
 * [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
-* [UniProt/SwissProt KnowledgeBase 2019_01](http://www.uniprot.org) - Resource on protein sequence and functional information (January 2019)
+* [UniProt/SwissProt KnowledgeBase 2019_02](http://www.uniprot.org) - Resource on protein sequence and functional information (February 2019)
 * [Pfam v32](http://pfam.xfam.org) - Database of protein families and domains (Sept 2018)
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (November 19th 2018)
+* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 13th 2019)
 
 ### News
+* March 21st 2019 - **0.8.0 release**
+     * Data bundle updates: ClinVar, UniProt, GWAS catalog
+     * Bundle bug: Missing VEP FASTA file for grch38
 * February 4th 2019 - **0.7.0 release**
      * Docker image update - VEP v95
      * Data bundle updates: ClinVar, DisGenet, dbNSFP and UniProt
@@ -70,15 +73,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Download and unpack the [latest software release (0.7.0)](https://github.com/sigven/gvanno/releases/tag/v0.7.0)
+1. Download and unpack the [latest software release (0.8.0)](https://github.com/sigven/gvanno/releases/tag/v0.8.0)
 2. Download and unpack the assembly-specific data bundle in the gvanno directory
-   * [grch37 data bundle](https://drive.google.com/file/d/1ct7bi-d91tB-QZUKkQhp4oHXJpkBJTc6) (approx 14Gb)
-   * [grch38 data bundle](https://drive.google.com/file/d/1C7fAvnrnv_GMwdPHJKK7V0WJfBrrchFr) (approx 14Gb)
+   * [grch37 data bundle](https://drive.google.com/file/d/1cJRaSD_UgeG34CnE3PHj3vxXSAAMN9Jl) (approx 14Gb)
+   * [grch38 data bundle](https://drive.google.com/file/d/1uZw5iEibKJV_9SmCusHcpzKBVzTu2pcH) (approx 15Gb)
    * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
     A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (0.7.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
-   * `docker pull sigven/gvanno:0.7.0` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (0.8.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
+   * `docker pull sigven/gvanno:0.8.0` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -100,26 +103,24 @@ A few elements of the workflow can be figured using the *gvanno* configuration f
 
 Run the workflow with **gvanno.py**, which takes the following arguments and options:
 
-	usage: gvanno.py [-h] [--input_vcf INPUT_VCF] [--force_overwrite] [--version]
-			  gvanno_dir output_dir {grch37,grch38} configuration_file
+	usage: gvanno.py [-h] [--force_overwrite] [--version]
+			  query_vcf gvanno_dir output_dir {grch37,grch38} configuration_file
 			  sample_id
 
 	Germline variant annotation (gvanno) workflow for clinical and functional
 	interpretation of germline nucleotide variants
 
 	positional arguments:
+	query_vcf			VCF input file with germline variants (SNVs/InDels)
 	gvanno_dir            gvanno base directory with accompanying data
-				    directory, e.g. ~/gvanno-0.7.0
+				    directory, e.g. ~/gvanno-0.8.0
 	output_dir            Output directory
 	{grch37,grch38}       grch37 or grch38
 	configuration_file    gvanno configuration file (TOML format)
 	sample_id             Sample identifier - prefix for output files
 
 	optional arguments:
 	-h, --help            show this help message and exit
-	--input_vcf INPUT_VCF
-				    VCF input file with somatic query variants
-				    (SNVs/InDels) (default: None)
 	--force_overwrite     The script will fail with an error if the output file
 				    already exists. Force the overwrite of existing result
 				    files by using this flag (default: False)
@@ -128,8 +129,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
 
-`python ~/gvanno-0.7.0/gvanno.py --input_vcf ~/gvanno-0.7.0/examples/example.vcf.gz`
-` ~/gvanno-0.7.0 ~/gvanno-0.7.0/examples grch37 ~/gvanno-0.7.0/gvanno.toml example`
+`python ~/gvanno-0.8.0/gvanno.py  ~/gvanno-0.8.0/examples/example_grch37.vcf.gz`
+` ~/gvanno-0.8.0 ~/gvanno-0.8.0/examples grch37 ~/gvanno-0.8.0/gvanno.toml example`
 
 
 This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:

diff --git a/examples/example.vcf.gz → examples/example.grch37.vcf.gz b/examples/example.vcf.gz → examples/example.grch37.vcf.gz
diff --git a/examples/example.vcf.gz.tbi → examples/example.grch37.vcf.gz.tbi b/examples/example.vcf.gz.tbi → examples/example.grch37.vcf.gz.tbi
diff --git a/examples/example.grch38.vcf.gz b/examples/example.grch38.vcf.gz
diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example.grch38.vcf.gz.tbi
diff --git a/gvanno.py b/gvanno.py
@@ -12,25 +12,24 @@
 import toml
 
 
-gvanno_version = '0.7.0'
-db_version = 'GVANNO_DB_VERSION = 20190204'
+gvanno_version = '0.8.0'
+db_version = 'GVANNO_DB_VERSION = 20190320'
 vep_version = '95'
 global vep_assembly
 
 def __main__():
 
    parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-   parser.add_argument('--input_vcf', dest = "input_vcf", help='VCF input file with somatic query variants (SNVs/InDels)')
    parser.add_argument('--force_overwrite', action = "store_true", help='The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag')
    parser.add_argument('--version', action='version', version='%(prog)s ' + str(gvanno_version))
-   parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.2.0')
+   parser.add_argument('query_vcf', help='VCF input file with germline query variants (SNVs/InDels)')
+   parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.8.0')
    parser.add_argument('output_dir',help='Output directory')
    parser.add_argument('genome_assembly',choices = ['grch37','grch38'], help='grch37 or grch38')
    parser.add_argument('configuration_file',help='gvanno configuration file (TOML format)')
    parser.add_argument('sample_id',help="Sample identifier - prefix for output files")
 
    docker_image_version = 'sigven/gvanno:' + str(gvanno_version)
-   #docker_image_version = 'sigven/vep95:' + str(gvanno_version)
    args = parser.parse_args()
 
    overwrite = 0
@@ -52,7 +51,7 @@ def __main__():
    else:
       err_msg = "gvanno configuration file " + str(args.configuration_file) + " does not exist - exiting"
       gvanno_error_message(err_msg,logger)
-   host_directories = verify_input_files(args.input_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)
+   host_directories = verify_input_files(args.query_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)
 
    run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, gvanno_version)
 
@@ -314,8 +313,8 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       vep_vcfanno_annotated_pass_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated.pass',vep_vcfanno_vcf) + '.gz'
 
       fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz")
-      pick_order = "biotype,canonical,appris,tsl,ccds,rank,length"
-      vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
+      pick_order = "canonical,appris,tsl,biotype,ccds,rank,length"
+      vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
       vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + pick_order + " --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
       if config_options['other']['vep_skip_intergenic'] == 1:
          vep_options = vep_options + " --no_intergenic"
@@ -372,7 +371,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
       gvanno_vcf2tsv_command_all = str(docker_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end
       logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")
       check_subprocess(gvanno_vcf2tsv_command_pass)
-      logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants)")
+      logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants")
       check_subprocess(gvanno_vcf2tsv_command_all)
       logger.info("Finished")
 

diff --git a/src/Dockerfile b/src/Dockerfile
@@ -1,7 +1,6 @@
 ###################################################
 # Stage 1 - docker container to build ensembl-vep #
 ###################################################
-#FROM ensemblorg/ensembl-vep:latest
 
 FROM ubuntu:16.04 as builder
 

diff --git a/src/gvanno.tgz b/src/gvanno.tgz
diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py
@@ -36,10 +36,8 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
    out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf)
 
    meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta)
-   vep_csq_index2fields = meta_vep_dbnsfp_info['vep_csq_index2fields']
-   vep_csq_fields2index = meta_vep_dbnsfp_info['vep_csq_fields2index']
    dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms']
-
+   vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
    vcf = VCF(query_vcf)
    for tag in vcf_infotags_meta:
       if lof_prediction == 0:
@@ -52,10 +50,18 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
    w = Writer(out_vcf, vcf)
    current_chrom = None
    num_chromosome_records_processed = 0
-   gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'SYMBOL':2, 'ENTREZ_ID':3, 'UNIPROT_ID':4, 'APPRIS':5,'UNIPROT_ACC':6,
-                        'REFSEQ_MRNA':7, 'CORUM_ID':8,'TUMOR_SUPPRESSOR':9,'ONCOGENE':10,'DISGENET_CUI':11,'MIM_PHENOTYPE_ID':12}
+   gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'ENSEMBL_PROTEIN_ID':2, 'SYMBOL':3, 'ENTREZ_ID':4, 'UNIPROT_ID':5, 'APPRIS':6,'UNIPROT_ACC':7,
+                        'REFSEQ_MRNA':8, 'CORUM_ID':9,'TUMOR_SUPPRESSOR':10,'ONCOGENE':11,'DISGENET_CUI':12,'MIM_PHENOTYPE_ID':13}
+
+   vcf_info_element_types = {}
+   for e in vcf.header_iter():
+      header_element = e.info()
+      if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element:
+         identifier = str(header_element['ID'])
+         fieldtype = str(header_element['Type'])
+         vcf_info_element_types[identifier] = fieldtype
+
    for rec in vcf:
-      all_transcript_consequences = []
       if current_chrom is None:
          current_chrom = str(rec.CHROM)
          num_chromosome_records_processed = 0
@@ -70,73 +76,29 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
          variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
          logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (vep_skip_intergenic in config set to true?)  - variant will be skipped')
          continue
-      gvanno_xref = {}
       num_chromosome_records_processed += 1
-      if not rec.INFO.get('GVANNO_XREF') is None:
-         for transcript_xref in rec.INFO.get('GVANNO_XREF').split(','):
-            xrefs = transcript_xref.split('|')
-            ensembl_transcript_id = str(xrefs[0])
-            gvanno_xref[ensembl_transcript_id] = {}
-            for annotation in gvanno_xref_map.keys():
-               annotation_index = gvanno_xref_map[annotation]
-               if annotation_index > (len(xrefs) - 1):
-                  continue
-               if xrefs[annotation_index] != '':
-                  gvanno_xref[ensembl_transcript_id][annotation] = xrefs[annotation_index]
+      gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")
+
       for identifier in ['CSQ','DBNSFP']:
          if identifier == 'CSQ':
-            num_picks = 0
-            for csq in rec.INFO.get(identifier).split(','):
-               csq_fields =  csq.split('|')
-               if csq_fields[vep_csq_fields2index['PICK']] == "1": ## only consider the primary/picked consequence when expanding with annotation tags
-                  num_picks += 1
-                  j = 0
-                  ## loop over all CSQ elements and set them in the vep_info_tags dictionary (for each alt_allele)
-                  while(j < len(csq_fields)):
-                     if j in vep_csq_index2fields:
-                        if csq_fields[j] != '':
-                           rec.INFO[vep_csq_index2fields[j]] = str(csq_fields[j])
-                           if vep_csq_index2fields[j] == 'Feature':
-                              ensembl_transcript_id = str(csq_fields[j])
-                              if ensembl_transcript_id in gvanno_xref:
-                                 for annotation in gvanno_xref_map.keys():
-                                    if annotation in gvanno_xref[ensembl_transcript_id]:
-                                       if annotation == 'TUMOR_SUPPRESSOR' or annotation == 'ONCOGENE':
-                                          rec.INFO[annotation] = True
-                                       else:
-                                          rec.INFO[annotation] = gvanno_xref[ensembl_transcript_id][annotation]
-                           if vep_csq_index2fields[j] == 'DOMAINS':
-                              domain_identifiers = str(csq_fields[j]).split('&')
-                              for v in domain_identifiers:
-                                 if v.startswith('Pfam_domain'):
-                                    rec.INFO['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$','',re.sub(r'Pfam_domain:','',v)))
-
-                           if vep_csq_index2fields[j] == 'Existing_variation':
-                              var_identifiers = str(csq_fields[j]).split('&')
-                              cosmic_identifiers = []
-                              dbsnp_identifiers = []
-                              for v in var_identifiers:
-                                 if v.startswith('COSM'):
-                                    cosmic_identifiers.append(v)
-                                 if v.startswith('rs'):
-                                    dbsnp_identifiers.append(v)
-                              if len(cosmic_identifiers) > 0:
-                                 rec.INFO['COSMIC_MUTATION_ID'] = '&'.join(cosmic_identifiers)
-                              if len(dbsnp_identifiers) > 0:
-                                 rec.INFO['DBSNPRSID'] = '&'.join(dbsnp_identifiers)
-
-                     j = j + 1
-                  annoutils.set_coding_change(rec)
-               symbol = '.'
-               if csq_fields[vep_csq_fields2index['SYMBOL']] != "":
-                  symbol = str(csq_fields[vep_csq_fields2index['SYMBOL']])
-               consequence_entry = str(csq_fields[vep_csq_fields2index['Consequence']]) + ':' + str(symbol) + ':' + str(csq_fields[vep_csq_fields2index['Feature_type']]) + ':' + str(csq_fields[vep_csq_fields2index['Feature']]) + ':' + str(csq_fields[vep_csq_fields2index['BIOTYPE']])
-               all_transcript_consequences.append(consequence_entry)
-
+            csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
          if identifier == 'DBNSFP':
             if not rec.INFO.get('DBNSFP') is None:
                annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
-      rec.INFO['VEP_ALL_CONSEQUENCE'] = ','.join(all_transcript_consequences)
+      if 'vep_all_csq' in csq_record_results:
+         rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results['vep_all_csq'])
+      if 'vep_block' in csq_record_results:
+         vep_csq_records = csq_record_results['vep_block']
+         block_idx = 0
+         record = vep_csq_records[block_idx]
+         for k in record:
+            if k in vcf_info_element_types:
+               if vcf_info_element_types[k] == "Flag" and record[k] == "1":
+                  rec.INFO[k] = True
+               else:
+                  if not record[k] is None:
+                     rec.INFO[k] = record[k]
+
       w.write_record(rec)
    w.close()
    logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))

diff --git a/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc b/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc