Skip to content

Commit

Permalink
1.3.2 release
Browse files Browse the repository at this point in the history
  • Loading branch information
sigven committed Jun 30, 2020
1 parent 93b80f0 commit 048ed86
Show file tree
Hide file tree
Showing 6 changed files with 90 additions and 39 deletions.
49 changes: 27 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,31 @@

### Overview

The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below).
The germline variant annotator (*gvanno*) is a simple, software package intended for analysis and interpretation of human DNA variants of germline origin. Variants and genes are annotated with disease-related and functional associations from a wide range of sources (see below). Technically, the workflow is built with the [Docker](https://www.docker.com) technology, but it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.

*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.

#### Annotation resources included in _gvanno_ - 1.3.0
#### Annotation resources included in _gvanno_ - 1.3.2

* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.0 (GENCODE v33/v19 as the gene reference dataset)
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.0, May 2019)
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v100.2 (GENCODE v34/v19 as the gene reference dataset)
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.1, June 2020)
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (April 2020)
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_04, April 2020)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_02, April 2020)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v32, Sept 2018)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 3rd 2020)
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2020)
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v7.0, May 2020)
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2020_06, June 2020)
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2020_03, June 2020)
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v33.1, May 2020)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (June 13th 2020)

### News
* June 30th 2020 - **1.3.2 release**
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform, Pfam, dbNSFP)
* Using GENCODE v34 as the correct transcript assembly for grch38 (see [issue](https://github.com/Ensembl/ensembl-vep/issues/749))
* Three new variant effect predictions from dbNSFP added: [ClinPred](https://doi.org/10.1016/j.ajhg.2018.08.005), [LIST-S2](https://doi.org/10.1093/nar/gkaa288), and [BayesDel](https://doi.org/10.1002/humu.23158)
* Added VEP plugin NearestExonJB
* Annotates relative position (to the exon-intron junction) of variants in introns and exons
* May 8th 2020 - **1.3.0 release**
* Upgrade of VEP (v100) - GENCODE release 33 (grch38)
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
Expand Down Expand Up @@ -55,29 +61,28 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
- CPUs: minimum 4
- [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)

##### STEP 1.1: Installation of Singularity (optional)
0. NB, this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions.
##### 1.1: Installation of Singularity (optional)

0. **Note: this has only been tested with Singularity version 2.4.2, your mileage may vary with other versions**.
1. [Install Singularity](https://sylabs.io/docs/)
2. Test that singularity works by running `singularity --version`
3. If you are in the gvanno directory, build the singularity image like so:

`cd src`

Now you can build the singularity image with:

`sudo ./buildSingularity.sh`

#### STEP 2: Download *gvanno* and data bundle

1. Download and unpack the [latest software release (1.3.0)](https://github.com/sigven/gvanno/releases/tag/v1.3.0)
1. Download and unpack the [latest software release (1.3.2)](https://github.com/sigven/gvanno/releases/tag/v1.3.2)
2. Download and unpack the assembly-specific data bundle in the gvanno directory
* [grch37 data bundle](https://drive.google.com/file/d/1TfDVsWw47t-1mG9bSsewOMC3vkGTssEy) (approx 16Gb)
* [grch38 data bundle](https://drive.google.com/file/d/1K93S7cTVKfIspf2_zpL7KSOGuzk6tlfs) (approx 17Gb)
* [grch37 data bundle](https://drive.google.com/file/d/1XJT8sSngl5T3HHQK2CZtZuwXX3rouEYg/) (approx 16Gb)
* [grch38 data bundle](https://drive.google.com/file/d/1M6gioFzvt6XOqRDTx4UXYD5sIVOH55IY) (approx 17Gb)
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`

A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
3. Pull the [gvanno Docker image (1.3.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
* `docker pull sigven/gvanno:1.3.0` (gvanno annotation engine)
3. Pull the [gvanno Docker image (1.3.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 1.9Gb):
* `docker pull sigven/gvanno:1.3.2` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand Down Expand Up @@ -105,7 +110,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

positional arguments:
query_vcf VCF input file with germline query variants (SNVs/InDels)
gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.1.0
gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-1.3.2
output_dir Output directory
{grch37,grch38} grch37 or grch38
configuration_file gvanno configuration file (TOML format)
Expand All @@ -121,8 +126,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

`python ~/gvanno-1.3.0/gvanno.py ~/gvanno-1.3.0/examples/example.grch37.vcf.gz --container docker`
` ~/gvanno-1.3.0 ~/gvanno-1.3.0/examples grch37 ~/gvanno-1.3.0/gvanno.toml example`
`python ~/gvanno-1.3.2/gvanno.py ~/gvanno-1.3.2/examples/example.grch37.vcf.gz --container docker`
` ~/gvanno-1.3.2 ~/gvanno-1.3.2/examples grch37 ~/gvanno-1.3.2/gvanno.toml example`

This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:

Expand Down
20 changes: 11 additions & 9 deletions gvanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,8 @@
import toml
from argparse import RawTextHelpFormatter



gvanno_version = '1.3.1'
db_version = 'GVANNO_DB_VERSION = 20200514'
gvanno_version = '1.3.2'
db_version = 'GVANNO_DB_VERSION = 20200629'
vep_version = '100'
global vep_assembly

Expand Down Expand Up @@ -377,25 +375,29 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz")
ancestor_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "human_ancestor.fa.gz")
loftee_dir = '/opt/vep/src/ensembl-vep/modules'
plugins_in_use = "NearestExonJB"
vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds " + \
"--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
"--uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin NearestExonJB,max_range=50000"
vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + str(config_options['other']['vep_pick_order']) + \
" --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + \
str(config_options['other']['n_vep_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
if config_options['other']['vep_skip_intergenic'] == 1:
vep_options = vep_options + " --no_intergenic"
if config_options['other']['lof_prediction'] == 1:
plugins_in_use = plugins_in_use + ", LoF"
vep_options += " --plugin LoF,loftee_path:" + loftee_dir + ",human_ancestor_fa:" + str(ancestor_assembly) + ",use_gerp_end_trunc:0 --dir_plugins " + loftee_dir
vep_main_command = str(container_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + docker_command_run_end
vep_bgzip_command = container_command_run1 + "bgzip -f -c " + str(vep_vcf) + " > " + str(vep_vcf) + ".gz" + docker_command_run_end
vep_tabix_command = str(container_command_run1) + "tabix -f -p vcf " + str(vep_vcf) + ".gz" + docker_command_run_end
logger = getlogger('gvanno-vep')

print()
if config_options['other']['lof_prediction'] == 1:
logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ") including loss-of-function prediction")
else:
logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")")
logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")")
logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flack_pick_allele)")
logger.info("VEP configuration - transcript pick order: " + str(config_options['other']['vep_pick_order']))
logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options")
logger.info("VEP configuration - skip intergenic: " + str(config_options['other']['vep_skip_intergenic']))
logger.info("VEP configuration - plugins in use: " + str(plugins_in_use))
check_subprocess(vep_main_command)
check_subprocess(vep_bgzip_command)
check_subprocess(vep_tabix_command)
Expand Down
Binary file modified src/gvanno.tgz
Binary file not shown.
7 changes: 4 additions & 3 deletions src/gvanno/gvanno_vcfanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@ def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvann
"""
Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
"""
clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_MEDGEN_CUI",
"CLINVAR_MEDGEN_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_HGVSP",
clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP",
"CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT",
"CLINVAR_REVIEW_STATUS_STARS"]
dbnsfp_info_tags = ["DBNSFP"]
uniprot_info_tags = ["UNIPROT_FEATURE"]
Expand Down Expand Up @@ -89,7 +89,8 @@ def append_to_vcf_header(gvanno_db_directory, datasource, vcfheader_file):

def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, conf_fname):
"""
Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). The datasource defines the set of tags that will be appended during annotation
Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource').
The datasource defines the set of tags that will be appended during annotation
"""
fh = open(conf_fname,'a')
if datasource != 'uniprot' and datasource != 'gvanno_xref':
Expand Down
40 changes: 35 additions & 5 deletions src/gvanno/lib/annoutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ def check_subprocess(command):
print (e.output.decode())
exit(0)


def is_integer(n):
try:
float(n)
except ValueError:
return False
else:
return float(n).is_integer()

def error_message(message, logger):
logger.error('')
logger.error(message)
Expand Down Expand Up @@ -257,10 +266,6 @@ def read_config_options(configuration_file, base_dir, genome_assembly, logger, w
if isinstance(config_options[section][var],str) and not isinstance(user_options[section][var],str):
err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting string)'
error_message(err_msg, logger)
# if section == 'tumor_type' and var == 'type':
# if not str(user_options[section][var]) in valid_tumor_types:
# err_msg('Configuration value for tumor type (' + str(user_options[section][var]) + ') is not a valid type')
# error_message(err_msg, logger)
normalization_options = ['default','exome','genome','exome2genome']
populations_tgp = ['afr','amr','eas','sas','eur','global']
populations_gnomad = ['afr','amr','eas','sas','nfe','oth','fin','asj','global']
Expand Down Expand Up @@ -343,7 +348,7 @@ def threeToOneAA(aa_change):
return aa_change

def map_variant_effect_predictors(rec, algorithms):

dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms)
if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None:
return
Expand Down Expand Up @@ -390,10 +395,17 @@ def map_variant_effect_predictors(rec, algorithms):
rec.INFO['DEOGEN2_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('primateai:'):
rec.INFO['PRIMATEAI_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('list_s2:'):
rec.INFO['LIST_S2_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('bayesdel_addaf:'):
rec.INFO['BAYESDEL_ADDAF_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('clinpred:'):
rec.INFO['CLINPRED_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('splice_site_rf:'):
rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1])
if algo_pred.startswith('splice_site_ada:'):
rec.INFO['SPLICE_SITE_ADA_DBNSFP'] = str(algo_pred.split(':')[1])



def detect_reserved_info_tag(tag, tag_name, logger):
Expand All @@ -413,6 +425,8 @@ def assign_cds_exon_intron_annotations(csq_record):
csq_record['EXONIC_STATUS'] = 'nonexonic'
csq_record['SPLICE_DONOR_RELEVANT'] = False
csq_record['NULL_VARIANT'] = False
csq_record['INTRON_POSITION'] = 0
csq_record['EXON_POSITION'] = 0

coding_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|protein_altering|inframe_)"
wes_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|inframe_|protein_altering|synonymous)"
Expand All @@ -429,6 +443,22 @@ def assign_cds_exon_intron_annotations(csq_record):
if re.match(r"^splice_region_variant", str(csq_record['Consequence'])) and re.search(r'(\+3(A|G)>|\+4G>|\+5G>)', str(csq_record['HGVSc'])):
csq_record['SPLICE_DONOR_RELEVANT'] = True

if re.match(r"splice_region_variant|intron_variant", str(csq_record['Consequence'])):
match = re.search(r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc']))
if match is not None:
pos = re.sub(r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)","",match.group(0))
if is_integer(pos):
csq_record['INTRON_POSITION'] = int(pos)

if re.match(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) and str(csq_record['NearestExonJB']) != "":
exon_pos_info = csq_record['NearestExonJB'].split("+")
if len(exon_pos_info) == 4:
if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end":
csq_record['EXON_POSITION'] = -int(exon_pos_info[1])
if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start":
csq_record['EXON_POSITION'] = int(exon_pos_info[1])


for m in ['HGVSp_short','CDS_CHANGE']:
csq_record[m] = '.'
if not csq_record['HGVSc'] is None:
Expand Down
13 changes: 13 additions & 0 deletions test_examples_docker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env python

import os

wd = os.getcwd()

cmd_grch37 = "python gvanno.py " + os.path.join(wd,"examples","example.grch37.vcf.gz") + " " + wd + " " + wd + " grch37 gvanno.toml example --container docker --no_vcf_validate --force_overwrite"
cmd_grch38 = "python gvanno.py " + os.path.join(wd,"examples","example.grch38.vcf.gz") + " " + wd + " " + wd + " grch38 gvanno.toml example --container docker --no_vcf_validate --force_overwrite"

os.system(cmd_grch37)
os.system(cmd_grch38)


0 comments on commit 048ed86

Please sign in to comment.