Skip to content

Commit

Permalink
0.8.0 update
Browse files Browse the repository at this point in the history
  • Loading branch information
sigven committed Mar 21, 2019
1 parent b5578f5 commit 8b5e122
Show file tree
Hide file tree
Showing 12 changed files with 240 additions and 164 deletions.
37 changes: 19 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,23 @@ The germline variant annotator (*gvanno*) is a simple, Docker-based software pac

*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.

#### Annotation resources included in _gvanno_ - 0.7.0
#### Annotation resources included in _gvanno_ - 0.8.0

* [VEP v95](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor (GENCODE v29/v19 as the gene reference dataset)
* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (December 2018)
* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (February 2019)
* [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP
* [dbSNP build 151](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (October 2017) - from VEP
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
* [ClinVar 20190103](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (January 2019)
* [ClinVar 20190305](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (March 2019)
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
* [UniProt/SwissProt KnowledgeBase 2019_01](http://www.uniprot.org) - Resource on protein sequence and functional information (January 2019)
* [UniProt/SwissProt KnowledgeBase 2019_02](http://www.uniprot.org) - Resource on protein sequence and functional information (February 2019)
* [Pfam v32](http://pfam.xfam.org) - Database of protein families and domains (Sept 2018)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (November 19th 2018)
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 13th 2019)

### News
* March 21st 2019 - **0.8.0 release**
* Data bundle updates: ClinVar, UniProt, GWAS catalog
* Bundle bug: Missing VEP FASTA file for grch38
* February 4th 2019 - **0.7.0 release**
* Docker image update - VEP v95
* Data bundle updates: ClinVar, DisGenet, dbNSFP and UniProt
Expand Down Expand Up @@ -70,15 +73,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha

#### STEP 2: Download *gvanno* and data bundle

1. Download and unpack the [latest software release (0.7.0)](https://github.com/sigven/gvanno/releases/tag/v0.7.0)
1. Download and unpack the [latest software release (0.8.0)](https://github.com/sigven/gvanno/releases/tag/v0.8.0)
2. Download and unpack the assembly-specific data bundle in the gvanno directory
* [grch37 data bundle](https://drive.google.com/file/d/1ct7bi-d91tB-QZUKkQhp4oHXJpkBJTc6) (approx 14Gb)
* [grch38 data bundle](https://drive.google.com/file/d/1C7fAvnrnv_GMwdPHJKK7V0WJfBrrchFr) (approx 14Gb)
* [grch37 data bundle](https://drive.google.com/file/d/1cJRaSD_UgeG34CnE3PHj3vxXSAAMN9Jl) (approx 14Gb)
* [grch38 data bundle](https://drive.google.com/file/d/1uZw5iEibKJV_9SmCusHcpzKBVzTu2pcH) (approx 15Gb)
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`

A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
3. Pull the [gvanno Docker image (0.7.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
* `docker pull sigven/gvanno:0.7.0` (gvanno annotation engine)
3. Pull the [gvanno Docker image (0.8.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
* `docker pull sigven/gvanno:0.8.0` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand All @@ -100,26 +103,24 @@ A few elements of the workflow can be figured using the *gvanno* configuration f

Run the workflow with **gvanno.py**, which takes the following arguments and options:

usage: gvanno.py [-h] [--input_vcf INPUT_VCF] [--force_overwrite] [--version]
gvanno_dir output_dir {grch37,grch38} configuration_file
usage: gvanno.py [-h] [--force_overwrite] [--version]
query_vcf gvanno_dir output_dir {grch37,grch38} configuration_file
sample_id

Germline variant annotation (gvanno) workflow for clinical and functional
interpretation of germline nucleotide variants

positional arguments:
query_vcf VCF input file with germline variants (SNVs/InDels)
gvanno_dir gvanno base directory with accompanying data
directory, e.g. ~/gvanno-0.7.0
directory, e.g. ~/gvanno-0.8.0
output_dir Output directory
{grch37,grch38} grch37 or grch38
configuration_file gvanno configuration file (TOML format)
sample_id Sample identifier - prefix for output files

optional arguments:
-h, --help show this help message and exit
--input_vcf INPUT_VCF
VCF input file with somatic query variants
(SNVs/InDels) (default: None)
--force_overwrite The script will fail with an error if the output file
already exists. Force the overwrite of existing result
files by using this flag (default: False)
Expand All @@ -128,8 +129,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt

The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:

`python ~/gvanno-0.7.0/gvanno.py --input_vcf ~/gvanno-0.7.0/examples/example.vcf.gz`
` ~/gvanno-0.7.0 ~/gvanno-0.7.0/examples grch37 ~/gvanno-0.7.0/gvanno.toml example`
`python ~/gvanno-0.8.0/gvanno.py ~/gvanno-0.8.0/examples/example_grch37.vcf.gz`
` ~/gvanno-0.8.0 ~/gvanno-0.8.0/examples grch37 ~/gvanno-0.8.0/gvanno.toml example`


This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:
Expand Down
File renamed without changes.
File renamed without changes.
Binary file added examples/example.grch38.vcf.gz
Binary file not shown.
Binary file added examples/example.grch38.vcf.gz.tbi
Binary file not shown.
17 changes: 8 additions & 9 deletions gvanno.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,24 @@
import toml


gvanno_version = '0.7.0'
db_version = 'GVANNO_DB_VERSION = 20190204'
gvanno_version = '0.8.0'
db_version = 'GVANNO_DB_VERSION = 20190320'
vep_version = '95'
global vep_assembly

def __main__():

parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--input_vcf', dest = "input_vcf", help='VCF input file with somatic query variants (SNVs/InDels)')
parser.add_argument('--force_overwrite', action = "store_true", help='The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag')
parser.add_argument('--version', action='version', version='%(prog)s ' + str(gvanno_version))
parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.2.0')
parser.add_argument('query_vcf', help='VCF input file with germline query variants (SNVs/InDels)')
parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.8.0')
parser.add_argument('output_dir',help='Output directory')
parser.add_argument('genome_assembly',choices = ['grch37','grch38'], help='grch37 or grch38')
parser.add_argument('configuration_file',help='gvanno configuration file (TOML format)')
parser.add_argument('sample_id',help="Sample identifier - prefix for output files")

docker_image_version = 'sigven/gvanno:' + str(gvanno_version)
#docker_image_version = 'sigven/vep95:' + str(gvanno_version)
args = parser.parse_args()

overwrite = 0
Expand All @@ -52,7 +51,7 @@ def __main__():
else:
err_msg = "gvanno configuration file " + str(args.configuration_file) + " does not exist - exiting"
gvanno_error_message(err_msg,logger)
host_directories = verify_input_files(args.input_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)
host_directories = verify_input_files(args.query_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)

run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, gvanno_version)

Expand Down Expand Up @@ -314,8 +313,8 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
vep_vcfanno_annotated_pass_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated.pass',vep_vcfanno_vcf) + '.gz'

fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz")
pick_order = "biotype,canonical,appris,tsl,ccds,rank,length"
vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
pick_order = "canonical,appris,tsl,biotype,ccds,rank,length"
vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq"
vep_options = "--vcf --quiet --check_ref --flag_pick_allele --pick_order " + pick_order + " --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
if config_options['other']['vep_skip_intergenic'] == 1:
vep_options = vep_options + " --no_intergenic"
Expand Down Expand Up @@ -372,7 +371,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
gvanno_vcf2tsv_command_all = str(docker_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")
check_subprocess(gvanno_vcf2tsv_command_pass)
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants)")
logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants")
check_subprocess(gvanno_vcf2tsv_command_all)
logger.info("Finished")

Expand Down
1 change: 0 additions & 1 deletion src/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
###################################################
# Stage 1 - docker container to build ensembl-vep #
###################################################
#FROM ensemblorg/ensembl-vep:latest

FROM ubuntu:16.04 as builder

Expand Down
Binary file modified src/gvanno.tgz
Binary file not shown.
96 changes: 29 additions & 67 deletions src/gvanno/gvanno_summarise.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,8 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf)

meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta)
vep_csq_index2fields = meta_vep_dbnsfp_info['vep_csq_index2fields']
vep_csq_fields2index = meta_vep_dbnsfp_info['vep_csq_fields2index']
dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms']

vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
vcf = VCF(query_vcf)
for tag in vcf_infotags_meta:
if lof_prediction == 0:
Expand All @@ -52,10 +50,18 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
w = Writer(out_vcf, vcf)
current_chrom = None
num_chromosome_records_processed = 0
gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'SYMBOL':2, 'ENTREZ_ID':3, 'UNIPROT_ID':4, 'APPRIS':5,'UNIPROT_ACC':6,
'REFSEQ_MRNA':7, 'CORUM_ID':8,'TUMOR_SUPPRESSOR':9,'ONCOGENE':10,'DISGENET_CUI':11,'MIM_PHENOTYPE_ID':12}
gvanno_xref_map = {'ENSEMBL_TRANSCRIPT_ID':0, 'ENSEMBL_GENE_ID':1, 'ENSEMBL_PROTEIN_ID':2, 'SYMBOL':3, 'ENTREZ_ID':4, 'UNIPROT_ID':5, 'APPRIS':6,'UNIPROT_ACC':7,
'REFSEQ_MRNA':8, 'CORUM_ID':9,'TUMOR_SUPPRESSOR':10,'ONCOGENE':11,'DISGENET_CUI':12,'MIM_PHENOTYPE_ID':13}

vcf_info_element_types = {}
for e in vcf.header_iter():
header_element = e.info()
if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element:
identifier = str(header_element['ID'])
fieldtype = str(header_element['Type'])
vcf_info_element_types[identifier] = fieldtype

for rec in vcf:
all_transcript_consequences = []
if current_chrom is None:
current_chrom = str(rec.CHROM)
num_chromosome_records_processed = 0
Expand All @@ -70,73 +76,29 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele
logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (vep_skip_intergenic in config set to true?) - variant will be skipped')
continue
gvanno_xref = {}
num_chromosome_records_processed += 1
if not rec.INFO.get('GVANNO_XREF') is None:
for transcript_xref in rec.INFO.get('GVANNO_XREF').split(','):
xrefs = transcript_xref.split('|')
ensembl_transcript_id = str(xrefs[0])
gvanno_xref[ensembl_transcript_id] = {}
for annotation in gvanno_xref_map.keys():
annotation_index = gvanno_xref_map[annotation]
if annotation_index > (len(xrefs) - 1):
continue
if xrefs[annotation_index] != '':
gvanno_xref[ensembl_transcript_id][annotation] = xrefs[annotation_index]
gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")

for identifier in ['CSQ','DBNSFP']:
if identifier == 'CSQ':
num_picks = 0
for csq in rec.INFO.get(identifier).split(','):
csq_fields = csq.split('|')
if csq_fields[vep_csq_fields2index['PICK']] == "1": ## only consider the primary/picked consequence when expanding with annotation tags
num_picks += 1
j = 0
## loop over all CSQ elements and set them in the vep_info_tags dictionary (for each alt_allele)
while(j < len(csq_fields)):
if j in vep_csq_index2fields:
if csq_fields[j] != '':
rec.INFO[vep_csq_index2fields[j]] = str(csq_fields[j])
if vep_csq_index2fields[j] == 'Feature':
ensembl_transcript_id = str(csq_fields[j])
if ensembl_transcript_id in gvanno_xref:
for annotation in gvanno_xref_map.keys():
if annotation in gvanno_xref[ensembl_transcript_id]:
if annotation == 'TUMOR_SUPPRESSOR' or annotation == 'ONCOGENE':
rec.INFO[annotation] = True
else:
rec.INFO[annotation] = gvanno_xref[ensembl_transcript_id][annotation]
if vep_csq_index2fields[j] == 'DOMAINS':
domain_identifiers = str(csq_fields[j]).split('&')
for v in domain_identifiers:
if v.startswith('Pfam_domain'):
rec.INFO['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$','',re.sub(r'Pfam_domain:','',v)))

if vep_csq_index2fields[j] == 'Existing_variation':
var_identifiers = str(csq_fields[j]).split('&')
cosmic_identifiers = []
dbsnp_identifiers = []
for v in var_identifiers:
if v.startswith('COSM'):
cosmic_identifiers.append(v)
if v.startswith('rs'):
dbsnp_identifiers.append(v)
if len(cosmic_identifiers) > 0:
rec.INFO['COSMIC_MUTATION_ID'] = '&'.join(cosmic_identifiers)
if len(dbsnp_identifiers) > 0:
rec.INFO['DBSNPRSID'] = '&'.join(dbsnp_identifiers)

j = j + 1
annoutils.set_coding_change(rec)
symbol = '.'
if csq_fields[vep_csq_fields2index['SYMBOL']] != "":
symbol = str(csq_fields[vep_csq_fields2index['SYMBOL']])
consequence_entry = str(csq_fields[vep_csq_fields2index['Consequence']]) + ':' + str(symbol) + ':' + str(csq_fields[vep_csq_fields2index['Feature_type']]) + ':' + str(csq_fields[vep_csq_fields2index['Feature']]) + ':' + str(csq_fields[vep_csq_fields2index['BIOTYPE']])
all_transcript_consequences.append(consequence_entry)

csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
if identifier == 'DBNSFP':
if not rec.INFO.get('DBNSFP') is None:
annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
rec.INFO['VEP_ALL_CONSEQUENCE'] = ','.join(all_transcript_consequences)
if 'vep_all_csq' in csq_record_results:
rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results['vep_all_csq'])
if 'vep_block' in csq_record_results:
vep_csq_records = csq_record_results['vep_block']
block_idx = 0
record = vep_csq_records[block_idx]
for k in record:
if k in vcf_info_element_types:
if vcf_info_element_types[k] == "Flag" and record[k] == "1":
rec.INFO[k] = True
else:
if not record[k] is None:
rec.INFO[k] = record[k]

w.write_record(rec)
w.close()
logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
Expand Down
Binary file modified src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc
Binary file not shown.
Loading

0 comments on commit 8b5e122

Please sign in to comment.