Protein Function Prediction method based on predicted contact information.
Install required softwares first.
- softwares
- python3
- tested on python 3.6.7
- trRosetta
- HHBlits
- GR-align
- seqkit
- GNU parallel
- python3
- sequence databases
- swissprot
- Uniclust30
- required to generate MSA with HHblits
Convert all proteins of swissprot into contact graphs in this steps.
Simply run DB/01-preprocessing.sh
after editing paths of swissprot.
In the script, those steps will be performed.
- take out unique sequences from swissprot
- filter sequences by length L=20~2000
- split into single FASTA files
- by default, the files will split into several directories which have upto 1000 files.
- the duplications is saved into file so that we can "extend" afterwards.
You will find about 470K of fasta files in
FASTA/\d{3}
directories.
Generate MSA file (.a3m
) for each proteins by using HHblits.
All files will be generated by running DB/10-msa.sh
, but it will takes insane time.
You may need to use a good cluster computers, and about 2TB of disk space to save all .a3m
files.
By using those MSAs, trRosetta will calculate the distance predictions.
Running DB/20-tr.sh
will run prediction for all those proteins.
This step will much faster by using GPUs, and generate ~38 TB of .npz
files in total.
Since the output of trRosetta is the probability of each distance bins, DB/30-convert.sh
will convert them into binary contact graph by using certain cutoff.
The distance cutoff is set 12 Angstrom
, but you can change to another values.
After this step, you will get the database that contains 2 files (.
and .
) for each proteins.
Those files and dup.txt
will be used in the prediction step.
Now, you are all set!
If you have a single FASTA query as query.fasta
, you can simply run below to make a prediction.
$ prediction/predict.sh query.fasta
The script will run,
- HHblits
- trRosetta
- convert into graph
- rank by gr-align
- post-processing
After all, you will find output/[query]/[query].prediction
as the result.
It is 3-column tables showing GO ID, GO category, and confidence score.
Yuki Kagaya, et al., "ContactPFP: Protein function prediction using predicted contact information." (in preparation)