By Maria Clara Martins
Find out if the protein is located in membrane or cytoplasm!
Avaible at: https://mcm-plpred.herokuapp.com/
Web application developed with Flask.
environment.yml
: Environment configuration file.requirements.txt
: Libs needed for the project.Makefile
: Create "rules" to centralize and execute main commands.plpred
: Main package directory, with application functions.data/
: Data directory. Raw data are saved indata/raw
, preprocessed data indata/processed
and trained models are saved indata/models
(models are serialized using pickle).plpred/models
: provides predictive models based on Random Forest, Gradient Boosting, Neural Networks (MLP) and SVM.tests/
: set of unit tests for Plpred components.
Clone the repository and run:
$ conda install make (Windows only, "make" comes by default in macOS and Linux)
$ make setup
$ make server
You can view the application at: http://localhost:8000/
usage: plpred-preprocess [-h] -m MEMBRANE_PROTEINS -c CYTOPLASM_PROTEINS -o OUTPUT
plpred-preprocess: data preprocessing tool
optional arguments:
-h, --help show this help message and exit
-m MEMBRANE_PROTEINS, --membrane_proteins MEMBRANE_PROTEINS
path to the file containing membrane proteins (.fasta)
-c CYTOPLASM_PROTEINS, --cytoplasm_proteins CYTOPLASM_PROTEINS
path to the file containing cytoplasm proteins (.fasta)
-o OUTPUT, --output OUTPUT
path to the output file (.csv)
usage: plpred-train [-h] -p PROCESSED_DATASET -o OUTPUT [-r]
[-a {random_forest,neural_network,gradient_boosting,svm}]
plpred-train: model training tool
optional arguments:
-h, --help show this help message and exit
-p PROCESSED_DATASET, --processed_dataset PROCESSED_DATASET
processed dataset generated by plpred-preprocess (.csv)
-o OUTPUT, --output OUTPUT
path to the output trained model (.pickle)
-r, --report show classification report
-a {random_forest,neural_network,gradient_boosting,svm}, --algorithm {random_forest,neural_network,gradient_boosting,svm}
machine learning algorithm
usage: plpred-predict [-h] -i INPUT -o OUTPUT -m MODEL
plpred-predict: subcellular location prediction tool
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input file (.fasta)
-o OUTPUT, --output OUTPUT
output file (.csv)
-m MODEL, --model MODEL
trained model (.pickle)
usage: plpred-server [-h] -H HOST -p PORT -m MODEL
plpred-server: subcellular location prediction server
optional arguments:
-h, --help show this help message and exit
-H HOST, --host HOST host adress
-p PORT, --port PORT host port
-m MODEL, --model MODEL
trained model to be deployed
(Standard) - RandomForestClassifier: A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
GradientBoostingClassifier: GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.
MLPClassifier: Multi-layer Perceptron classifier. This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
C-Support Vector Classification: The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.