ELASPIC3 (EL3) is a gradient-boosted decision tree model which uses features generated using pretrained deep neural networks to predict the effect of mutations. In contrast to its predecessor ELASPIC2 [1], ELASPIC3 incorporates features extracted from multiple sequence alignments (MSAs), including the embeddings produced by AlphaFold [5]. ELASPIC3 is trained solely to predict whether a mutation is deleterious or benign, using mutations in the UniParc humsavar.txt file and the CAGI6 Sherloc dataset as the training data.
Name | Description |
---|---|
AlphaFold | AlphaFold [5] is a deep neural network which takes as input an MSA and, optionally, structural templates, and predicts the structure of the protein as well as other properties, including the probability distribution for each residue in the MSA. |
MSA | The multiple sequence alignment (MSA) is processed to extract various statistics, including the number of times the wild type and mutant residues appear in a given position. |
ProtBERT | ProtBERT [3] is a deep neural network trained to reconstruct masked amino acids in millions of protein sequences. |
ProteinSolver | ProteinSolver [2] is a graph neural network trained to reconstruct masked amino acids given the distagram describing the topology of the protein. |
Rosetta ΔΔG | Rosetta [2] uses a semi-empirical energy function to evaluate the stability of the wild type and the mutant proteins. We use the cartesian_ddg protocol to obtain energy terms that are averaged over different protein conformations. |
On the CAGI6 Sherloc progress tracker, the final ELASPIC3 model achieved an AUC of 0.946, a recall at 0.8 precision of 0.866, and a TNR at 0.95 NPV of 0.697. It does not appear to be the best-performing method... Nevertheless, it still performs better than ELASPIC2.
We evaluated the relative contribution of each feature generation method to the final performance of ELASPIC3 by training models that use features generated using all but one of the methods. The most substantial decrease in performance arises when we exclude features generated using AlphaFold.
We also evaluated the ability of each method independently to predict the effect of mutations by training models that use features generated using only one of the methods. Using only the features generated with AlphaFold, we are able to train a model that shows comparable performance to the ELASPIC3 model trained using the entire feature set.
Finally, we evaluated the ability of the different models to predict the effect of mutations without supervised fine-tuning. We compare methods based on a single feature from each method that shows the highest AUC in predicting mutation deleteriousness.
Overview: http://genomeinterpretation.org/cagi6-cam.html.
Our submission: https://www.synapse.org/#!Synapse:syn26145788/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 40_cagi6_cam_submission.ipynb |
Load the CAM dataset, generate features, and make predictions. |
Submission files
Filename | Description |
---|---|
strokach_modelnumber_1.tsv |
Predictions made using ELASPIC2 [1]. |
strokach_modelnumber_2.tsv |
Predictions made using ProteinSolver [2]. |
strokach_modelnumber_3.tsv |
Predictions made using ProtBert [3]. |
strokach_modelnumber_4.tsv |
Predictions made using Rosetta's cartesian_ddg protocol [4]. |
strokach_modelnumber_5.tsv |
Predictions made using ELASPIC3 [7f9826be ] with AlphaFold [5] features for wildtype protein . |
strokach_modelnumber_6.tsv |
Predictions made using ELASPIC3 [900500fe ] with AlphaFold [5] features for wildtype and mutant proteins . |
Overview: http://genomeinterpretation.org/cagi6-mapk1.html.
Our submission: https://www.synapse.org/#!Synapse:syn26145783/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 40_cagi6_mapk1_submission.ipynb |
Load the MAPk1 dataset, generate features, and make predictions. |
Submission files
Filename | Description |
---|---|
strokach_modelnumber_1.tsv |
Predictions made using ELASPIC2 [1]. |
strokach_modelnumber_2.tsv |
Predictions made using ProteinSolver [2]. |
strokach_modelnumber_3.tsv |
Predictions made using ProtBert [3]. |
strokach_modelnumber_4.tsv |
Predictions made using Rosetta's cartesian_ddg protocol [4]. |
strokach_modelnumber_5.tsv |
Predictions made using ELASPIC3 [7f9826be ] with AlphaFold [5] features for wildtype protein . |
strokach_modelnumber_6.tsv |
Predictions made using ELASPIC3 [900500fe ] with AlphaFold [5] features for wildtype and mutant proteins . |
Overview: http://genomeinterpretation.org/cagi6-mapk3.html.
Our submission: https://www.synapse.org/#!Synapse:syn26145778/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 40_cagi6_mapk3_submission.ipynb |
Load the MAPk3 dataset, generate features, and make predictions. |
Submission files
Filename | Description |
---|---|
strokach_modelnumber_1.tsv |
Predictions made using ELASPIC2 [1]. |
strokach_modelnumber_2.tsv |
Predictions made using ProteinSolver [2]. |
strokach_modelnumber_3.tsv |
Predictions made using ProtBert [3]. |
strokach_modelnumber_4.tsv |
Predictions made using Rosetta's cartesian_ddg protocol [4]. |
strokach_modelnumber_5.tsv |
Predictions made using ELASPIC3 [7f9826be ] with AlphaFold [5] features for wildtype protein . |
strokach_modelnumber_6.tsv |
Predictions made using ELASPIC3 [900500fe ] with AlphaFold [5] features for wildtype and mutant proteins . |
Overview: http://genomeinterpretation.org/cagi6-mthfr.html.
Our submission: https://www.synapse.org/#!Synapse:syn25891794/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 35_cagi_mthfr_predictions.ipynb |
Load the MTHFR dataset, generate features, and make predictions. |
🗒 40_cagi6_mthfr_submission.ipynb |
Prepare submission for the CAGI6 challenge. |
Submission files
Filenames | Description |
---|---|
ostrokach_cataWT_model_1.tsv ostrokach_cataAV_model_1.tsv ostrokach_reguWT_model_1.tsv ostrokach_reguAV_model_1.tsv |
Predictions were made using ELASPIC2 [1] and were adjusted to match the target distribution. |
ostrokach_cataWT_model_2.tsv ostrokach_cataAV_model_2.tsv ostrokach_reguWT_model_2.tsv ostrokach_reguAV_model_2.tsv |
Predictions were made using ProteinSolver [2] and were adjusted to match the target distribution. |
ostrokach_cataWT_model_3.tsv ostrokach_cataAV_model_3.tsv ostrokach_reguWT_model_3.tsv ostrokach_reguAV_model_3.tsv |
Predictions were made using ProtBert [3] and were adjusted to match the target distribution. |
ostrokach_cataWT_model_4.tsv ostrokach_cataAV_model_4.tsv ostrokach_reguWT_model_4.tsv ostrokach_reguAV_model_4.tsv |
Predictions were made using ELASPIC2 [1] without any subsequent adjustment. |
ostrokach_cataWT_model_5.tsv ostrokach_cataAV_model_5.tsv ostrokach_reguWT_model_5.tsv ostrokach_reguAV_model_5.tsv |
Predictions were made using ProteinSolver [2] without any subsequent adjustment. |
ostrokach_cataWT_model_6.tsv ostrokach_cataAV_model_6.tsv ostrokach_reguWT_model_6.tsv ostrokach_reguAV_model_6.tsv |
Predictions were made using ProtBert [3] without any subsequent adjustment. |
Overview: http://genomeinterpretation.org/cagi6-hmbs.html.
Our submission: https://www.synapse.org/#!Synapse:syn26159218/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 30_cagi6_hmbs.ipynb |
Load the HMBS dataset. |
🗒 35_cagi6_hmbs_alphafold.ipynb |
Generate AlphaFold2 features. |
🗒 35_cagi6_hmbs_el2.ipynb |
Generate ELASPIC2 scores and features. |
🗒 35_cagi6_hmbs_rosetta.ipynb |
Generate Rosetta scores and features. |
🗒 40_cagi6_hmbs_submission.ipynb |
Prepare submission for the CAGI6 challenge. |
Submission files
Filename | Description |
---|---|
strokach_modelnumber_1.tsv |
Predictions made using ELASPIC2 [1]. |
strokach_modelnumber_2.tsv |
Predictions made using ProteinSolver [2]. |
strokach_modelnumber_3.tsv |
Predictions made using ProtBert [3]. |
strokach_modelnumber_4.tsv |
Predictions made using Rosetta's cartesian_ddg protocol [4]. |
strokach_modelnumber_5.tsv |
Predictions made using ELASPIC3 [7f9826be ] with AlphaFold [5] features for wildtype protein . |
For the Sherloc clinical classification challenge, we trained new models using both the provided training data and the mutations listed in the UniProt humsavar.txt file.
Overview: http://genomeinterpretation.org/cagi6-invitae.html.
Our submission: https://www.synapse.org/#!Synapse:syn26272013/files/.
Relevant notebooks
Name | Description |
---|---|
🗒 30_cagi6_sherloc.ipynb |
Load the Sherloc dataset. |
🗒 30_humsavar.ipynb |
Load the humsavar dataset. |
🗒 31_run_alphafold_wt.ipynb |
Generate AlphaFold embeddings. |
🗒 31_run_msa_analysis.ipynb |
Generate basic MSA features. |
🗒 31_run_protbert.ipynb |
Generate ProtBert features. |
🗒 31_run_proteinsolver.ipynb |
Generate ProteinSolver features. |
🗒 31_run_rosetta_ddg.ipynb |
Generate Rosetta features. |
🗒 32_process_alphafold.ipynb |
Process AlphaFold embeddings into features. |
🗒 37_cagi6_sherloc_combine_results.ipynb |
Combine features generated using all methods for the Sherloc dataset. |
🗒 37_humsavar_combine_results.ipynb |
Combine features generated using all methods for the humsavar dataset. |
🗒 38_cagi6_sherloc_train_model.ipynb |
Train a machine learning model using Sherloc + humsavar data. |
🗒 39_cagi6_sherloc_finetune_model.ipynb |
Finetune the trained machine learning model and perform feature elimination. |
🗒 40_cagi6_sherloc_submission.ipynb |
Make predictions for the test dataset and prepare submission for the CAGI6 challenge. |
Submission files
Filename | Description |
---|---|
strokach_modelnumber_1.tsv |
Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data). All available AlphaFold embeddings were used by the model. |
strokach_modelnumber_2.tsv |
Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained using both Sherloc and humsavar data). |
strokach_modelnumber_3.tsv |
Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype protein (trained only using Sherloc data). |
strokach_modelnumber_4.tsv |
Predictions made using ELASPIC2 with AlphaFold [4] features for wildtype and mutant proteins (trained only using Sherloc data). |
strokach_modelnumber_5.tsv |
Predictions made using ELASPIC2 [1]. |
strokach_modelnumber_6.tsv |
Predictions made using AlphaFold [4]. |
- [1] Strokach et al. (2021). ELASPIC2 (EL2): Combining Contextualized Language Models and Graph Neural Networks to Predict Effects of Mutations. https://doi.org/10.1016/j.jmb.2021.166810
- [2] Strokach et al. (2020). Fast and Flexible Protein Design Using Deep Graph Neural Networks. https://doi.org/10.1016/j.cels.2020.08.016
- [3] Elnaggar et al. (2020). ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing. https://doi.org/10.1101/2020.07.12.199554
- [4] Park et al. (2016). Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. https://doi.org/10.1021/acs.jctc.6b00819
- [5] Jumper et al. (2021). Highly accurate protein structure prediction with AlphaFold. https://doi.org/10.1038/s41586-021-03819-2