Skip to content

Our submissions to the CAGI6 community assessment.

Notifications You must be signed in to change notification settings

elaspic/cagi6-submission

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CAGI6 community assessment

Introduction

ELASPIC3 (EL3) is a gradient-boosted decision tree model which uses features generated using pretrained deep neural networks to predict the effect of mutations. In contrast to its predecessor ELASPIC2 [1], ELASPIC3 incorporates features extracted from multiple sequence alignments (MSAs), including the embeddings produced by AlphaFold [5]. ELASPIC3 is trained solely to predict whether a mutation is deleterious or benign, using mutations in the UniParc humsavar.txt file and the CAGI6 Sherloc dataset as the training data.

Methods

Feature generation

Name Description
AlphaFold AlphaFold [5] is a deep neural network which takes as input an MSA and, optionally, structural templates, and predicts the structure of the protein as well as other properties, including the probability distribution for each residue in the MSA.
MSA The multiple sequence alignment (MSA) is processed to extract various statistics, including the number of times the wild type and mutant residues appear in a given position.
ProtBERT ProtBERT [3] is a deep neural network trained to reconstruct masked amino acids in millions of protein sequences.
ProteinSolver ProteinSolver [2] is a graph neural network trained to reconstruct masked amino acids given the distagram describing the topology of the protein.
Rosetta ΔΔG Rosetta [2] uses a semi-empirical energy function to evaluate the stability of the wild type and the mutant proteins. We use the cartesian_ddg protocol to obtain energy terms that are averaged over different protein conformations.

Final model performance

On the CAGI6 Sherloc progress tracker, the final ELASPIC3 model achieved an AUC of 0.946, a recall at 0.8 precision of 0.866, and a TNR at 0.95 NPV of 0.697. It does not appear to be the best-performing method... Nevertheless, it still performs better than ELASPIC2.

Ablation experiments

We evaluated the relative contribution of each feature generation method to the final performance of ELASPIC3 by training models that use features generated using all but one of the methods. The most substantial decrease in performance arises when we exclude features generated using AlphaFold.

Supervised performance

We also evaluated the ability of each method independently to predict the effect of mutations by training models that use features generated using only one of the methods. Using only the features generated with AlphaFold, we are able to train a model that shows comparable performance to the ELASPIC3 model trained using the entire feature set.

Unsupervised (one-shot) performance

Finally, we evaluated the ability of the different models to predict the effect of mutations without supervised fine-tuning. We compare methods based on a single feature from each method that shows the highest AUC in predicting mutation deleteriousness.

Individual submissions

Calmodulin

Overview: http://genomeinterpretation.org/cagi6-cam.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145788/files/.

Relevant notebooks
Name Description
πŸ—’ 40_cagi6_cam_submission.ipynb Load the CAM dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MAPK1

Overview: http://genomeinterpretation.org/cagi6-mapk1.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145783/files/.

Relevant notebooks
Name Description
πŸ—’ 40_cagi6_mapk1_submission.ipynb Load the MAPk1 dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MAPK3

Overview: http://genomeinterpretation.org/cagi6-mapk3.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145778/files/.

Relevant notebooks
Name Description
πŸ—’ 40_cagi6_mapk3_submission.ipynb Load the MAPk3 dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MTHFR

Overview: http://genomeinterpretation.org/cagi6-mthfr.html.

Our submission: https://www.synapse.org/#!Synapse:syn25891794/files/.

Relevant notebooks
Name Description
πŸ—’ 35_cagi_mthfr_predictions.ipynb Load the MTHFR dataset, generate features, and make predictions.
πŸ—’ 40_cagi6_mthfr_submission.ipynb Prepare submission for the CAGI6 challenge.
Submission files
Filenames Description
ostrokach_cataWT_model_1.tsv
ostrokach_cataAV_model_1.tsv
ostrokach_reguWT_model_1.tsv
ostrokach_reguAV_model_1.tsv
Predictions were made using ELASPIC2 [1] and were adjusted to match the target distribution.
ostrokach_cataWT_model_2.tsv
ostrokach_cataAV_model_2.tsv
ostrokach_reguWT_model_2.tsv
ostrokach_reguAV_model_2.tsv
Predictions were made using ProteinSolver [2] and were adjusted to match the target distribution.
ostrokach_cataWT_model_3.tsv
ostrokach_cataAV_model_3.tsv
ostrokach_reguWT_model_3.tsv
ostrokach_reguAV_model_3.tsv
Predictions were made using ProtBert [3] and were adjusted to match the target distribution.
ostrokach_cataWT_model_4.tsv
ostrokach_cataAV_model_4.tsv
ostrokach_reguWT_model_4.tsv
ostrokach_reguAV_model_4.tsv
Predictions were made using ELASPIC2 [1] without any subsequent adjustment.
ostrokach_cataWT_model_5.tsv
ostrokach_cataAV_model_5.tsv
ostrokach_reguWT_model_5.tsv
ostrokach_reguAV_model_5.tsv
Predictions were made using ProteinSolver [2] without any subsequent adjustment.
ostrokach_cataWT_model_6.tsv
ostrokach_cataAV_model_6.tsv
ostrokach_reguWT_model_6.tsv
ostrokach_reguAV_model_6.tsv
Predictions were made using ProtBert [3] without any subsequent adjustment.

HMBS

Overview: http://genomeinterpretation.org/cagi6-hmbs.html.

Our submission: https://www.synapse.org/#!Synapse:syn26159218/files/.

Relevant notebooks
Name Description
πŸ—’ 30_cagi6_hmbs.ipynb Load the HMBS dataset.
πŸ—’ 35_cagi6_hmbs_alphafold.ipynb Generate AlphaFold2 features.
πŸ—’ 35_cagi6_hmbs_el2.ipynb Generate ELASPIC2 scores and features.
πŸ—’ 35_cagi6_hmbs_rosetta.ipynb Generate Rosetta scores and features.
πŸ—’ 40_cagi6_hmbs_submission.ipynb Prepare submission for the CAGI6 challenge.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .

Sherloc clinical classification

For the Sherloc clinical classification challenge, we trained new models using both the provided training data and the mutations listed in the UniProt humsavar.txt file.

Overview: http://genomeinterpretation.org/cagi6-invitae.html.

Our submission: https://www.synapse.org/#!Synapse:syn26272013/files/.

Relevant notebooks
Name Description
πŸ—’ 30_cagi6_sherloc.ipynb Load the Sherloc dataset.
πŸ—’ 30_humsavar.ipynb Load the humsavar dataset.
πŸ—’ 31_run_alphafold_wt.ipynb Generate AlphaFold embeddings.
πŸ—’ 31_run_msa_analysis.ipynb Generate basic MSA features.
πŸ—’ 31_run_protbert.ipynb Generate ProtBert features.
πŸ—’ 31_run_proteinsolver.ipynb Generate ProteinSolver features.
πŸ—’ 31_run_rosetta_ddg.ipynb Generate Rosetta features.
πŸ—’ 32_process_alphafold.ipynb Process AlphaFold embeddings into features.
πŸ—’ 37_cagi6_sherloc_combine_results.ipynb Combine features generated using all methods for the Sherloc dataset.
πŸ—’ 37_humsavar_combine_results.ipynb Combine features generated using all methods for the humsavar dataset.
πŸ—’ 38_cagi6_sherloc_train_model.ipynb Train a machine learning model using Sherloc + humsavar data.
πŸ—’ 39_cagi6_sherloc_finetune_model.ipynb Finetune the trained machine learning model and perform feature elimination.
πŸ—’ 40_cagi6_sherloc_submission.ipynb Make predictions for the test dataset and prepare submission for the CAGI6 challenge.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data). All available AlphaFold embeddings were used by the model.
strokach_modelnumber_2.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data).
strokach_modelnumber_3.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained only using Sherloc data).
strokach_modelnumber_4.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype and mutant proteins (trained only using Sherloc data).
strokach_modelnumber_5.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_6.tsv Predictions made using AlphaFold [4].

References

About

Our submissions to the CAGI6 community assessment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published