Skip to content

Latest commit

 

History

History
293 lines (211 loc) · 21.3 KB

README.md

File metadata and controls

293 lines (211 loc) · 21.3 KB

CAGI6 community assessment

Introduction

ELASPIC3 (EL3) is a gradient-boosted decision tree model which uses features generated using pretrained deep neural networks to predict the effect of mutations. In contrast to its predecessor ELASPIC2 [1], ELASPIC3 incorporates features extracted from multiple sequence alignments (MSAs), including the embeddings produced by AlphaFold [5]. ELASPIC3 is trained solely to predict whether a mutation is deleterious or benign, using mutations in the UniParc humsavar.txt file and the CAGI6 Sherloc dataset as the training data.

Methods

Feature generation

Name Description
AlphaFold AlphaFold [5] is a deep neural network which takes as input an MSA and, optionally, structural templates, and predicts the structure of the protein as well as other properties, including the probability distribution for each residue in the MSA.
MSA The multiple sequence alignment (MSA) is processed to extract various statistics, including the number of times the wild type and mutant residues appear in a given position.
ProtBERT ProtBERT [3] is a deep neural network trained to reconstruct masked amino acids in millions of protein sequences.
ProteinSolver ProteinSolver [2] is a graph neural network trained to reconstruct masked amino acids given the distagram describing the topology of the protein.
Rosetta ΔΔG Rosetta [2] uses a semi-empirical energy function to evaluate the stability of the wild type and the mutant proteins. We use the cartesian_ddg protocol to obtain energy terms that are averaged over different protein conformations.

Final model performance

On the CAGI6 Sherloc progress tracker, the final ELASPIC3 model achieved an AUC of 0.946, a recall at 0.8 precision of 0.866, and a TNR at 0.95 NPV of 0.697. It does not appear to be the best-performing method... Nevertheless, it still performs better than ELASPIC2.

Ablation experiments

We evaluated the relative contribution of each feature generation method to the final performance of ELASPIC3 by training models that use features generated using all but one of the methods. The most substantial decrease in performance arises when we exclude features generated using AlphaFold.

Supervised performance

We also evaluated the ability of each method independently to predict the effect of mutations by training models that use features generated using only one of the methods. Using only the features generated with AlphaFold, we are able to train a model that shows comparable performance to the ELASPIC3 model trained using the entire feature set.

Unsupervised (one-shot) performance

Finally, we evaluated the ability of the different models to predict the effect of mutations without supervised fine-tuning. We compare methods based on a single feature from each method that shows the highest AUC in predicting mutation deleteriousness.

Individual submissions

Calmodulin

Overview: http://genomeinterpretation.org/cagi6-cam.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145788/files/.

Relevant notebooks
Name Description
🗒 40_cagi6_cam_submission.ipynb Load the CAM dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MAPK1

Overview: http://genomeinterpretation.org/cagi6-mapk1.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145783/files/.

Relevant notebooks
Name Description
🗒 40_cagi6_mapk1_submission.ipynb Load the MAPk1 dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MAPK3

Overview: http://genomeinterpretation.org/cagi6-mapk3.html.

Our submission: https://www.synapse.org/#!Synapse:syn26145778/files/.

Relevant notebooks
Name Description
🗒 40_cagi6_mapk3_submission.ipynb Load the MAPk3 dataset, generate features, and make predictions.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .
strokach_modelnumber_6.tsv Predictions made using ELASPIC3 [900500fe] with AlphaFold [5] features for wildtype and mutant proteins .

MTHFR

Overview: http://genomeinterpretation.org/cagi6-mthfr.html.

Our submission: https://www.synapse.org/#!Synapse:syn25891794/files/.

Relevant notebooks
Name Description
🗒 35_cagi_mthfr_predictions.ipynb Load the MTHFR dataset, generate features, and make predictions.
🗒 40_cagi6_mthfr_submission.ipynb Prepare submission for the CAGI6 challenge.
Submission files
Filenames Description
ostrokach_cataWT_model_1.tsv
ostrokach_cataAV_model_1.tsv
ostrokach_reguWT_model_1.tsv
ostrokach_reguAV_model_1.tsv
Predictions were made using ELASPIC2 [1] and were adjusted to match the target distribution.
ostrokach_cataWT_model_2.tsv
ostrokach_cataAV_model_2.tsv
ostrokach_reguWT_model_2.tsv
ostrokach_reguAV_model_2.tsv
Predictions were made using ProteinSolver [2] and were adjusted to match the target distribution.
ostrokach_cataWT_model_3.tsv
ostrokach_cataAV_model_3.tsv
ostrokach_reguWT_model_3.tsv
ostrokach_reguAV_model_3.tsv
Predictions were made using ProtBert [3] and were adjusted to match the target distribution.
ostrokach_cataWT_model_4.tsv
ostrokach_cataAV_model_4.tsv
ostrokach_reguWT_model_4.tsv
ostrokach_reguAV_model_4.tsv
Predictions were made using ELASPIC2 [1] without any subsequent adjustment.
ostrokach_cataWT_model_5.tsv
ostrokach_cataAV_model_5.tsv
ostrokach_reguWT_model_5.tsv
ostrokach_reguAV_model_5.tsv
Predictions were made using ProteinSolver [2] without any subsequent adjustment.
ostrokach_cataWT_model_6.tsv
ostrokach_cataAV_model_6.tsv
ostrokach_reguWT_model_6.tsv
ostrokach_reguAV_model_6.tsv
Predictions were made using ProtBert [3] without any subsequent adjustment.

HMBS

Overview: http://genomeinterpretation.org/cagi6-hmbs.html.

Our submission: https://www.synapse.org/#!Synapse:syn26159218/files/.

Relevant notebooks
Name Description
🗒 30_cagi6_hmbs.ipynb Load the HMBS dataset.
🗒 35_cagi6_hmbs_alphafold.ipynb Generate AlphaFold2 features.
🗒 35_cagi6_hmbs_el2.ipynb Generate ELASPIC2 scores and features.
🗒 35_cagi6_hmbs_rosetta.ipynb Generate Rosetta scores and features.
🗒 40_cagi6_hmbs_submission.ipynb Prepare submission for the CAGI6 challenge.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_2.tsv Predictions made using ProteinSolver [2].
strokach_modelnumber_3.tsv Predictions made using ProtBert [3].
strokach_modelnumber_4.tsv Predictions made using Rosetta's cartesian_ddg protocol [4].
strokach_modelnumber_5.tsv Predictions made using ELASPIC3 [7f9826be] with AlphaFold [5] features for wildtype protein .

Sherloc clinical classification

For the Sherloc clinical classification challenge, we trained new models using both the provided training data and the mutations listed in the UniProt humsavar.txt file.

Overview: http://genomeinterpretation.org/cagi6-invitae.html.

Our submission: https://www.synapse.org/#!Synapse:syn26272013/files/.

Relevant notebooks
Name Description
🗒 30_cagi6_sherloc.ipynb Load the Sherloc dataset.
🗒 30_humsavar.ipynb Load the humsavar dataset.
🗒 31_run_alphafold_wt.ipynb Generate AlphaFold embeddings.
🗒 31_run_msa_analysis.ipynb Generate basic MSA features.
🗒 31_run_protbert.ipynb Generate ProtBert features.
🗒 31_run_proteinsolver.ipynb Generate ProteinSolver features.
🗒 31_run_rosetta_ddg.ipynb Generate Rosetta features.
🗒 32_process_alphafold.ipynb Process AlphaFold embeddings into features.
🗒 37_cagi6_sherloc_combine_results.ipynb Combine features generated using all methods for the Sherloc dataset.
🗒 37_humsavar_combine_results.ipynb Combine features generated using all methods for the humsavar dataset.
🗒 38_cagi6_sherloc_train_model.ipynb Train a machine learning model using Sherloc + humsavar data.
🗒 39_cagi6_sherloc_finetune_model.ipynb Finetune the trained machine learning model and perform feature elimination.
🗒 40_cagi6_sherloc_submission.ipynb Make predictions for the test dataset and prepare submission for the CAGI6 challenge.
Submission files
Filename Description
strokach_modelnumber_1.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data). All available AlphaFold embeddings were used by the model.
strokach_modelnumber_2.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data).
strokach_modelnumber_3.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained only using Sherloc data).
strokach_modelnumber_4.tsv Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype and mutant proteins (trained only using Sherloc data).
strokach_modelnumber_5.tsv Predictions made using ELASPIC2 [1].
strokach_modelnumber_6.tsv Predictions made using AlphaFold [4].

References