Skip to content

A protein subcellular location prediction program.

License

Notifications You must be signed in to change notification settings

mariacmartins/plpred

Repository files navigation

Plpred

By Maria Clara Martins

About the project:

A protein subcellular location prediction program (based on Machine Learning models). 🧬

Find out if the protein is located in membrane or cytoplasm!

Avaible at: https://mcm-plpred.herokuapp.com/

Web application developed with Flask.

📁 Project Structure:

  • environment.yml: Environment configuration file.
  • requirements.txt: Libs needed for the project.
  • Makefile: Create "rules" to centralize and execute main commands.
  • plpred: Main package directory, with application functions.
  • data/: Data directory. Raw data are saved in data/raw, preprocessed data in data/processed and trained models are saved in data/models (models are serialized using pickle).
  • plpred/models: provides predictive models based on Random Forest, Gradient Boosting, Neural Networks (MLP) and SVM.
  • tests/: set of unit tests for Plpred components.

Running locally (Setup):

Clone the repository and run:
$ conda install make (Windows only, "make" comes by default in macOS and Linux)
$ make setup
$ make server
You can view the application at: http://localhost:8000/

👨‍💻 Command line interface (CLI):

plpred-preprocess:

usage: plpred-preprocess [-h] -m MEMBRANE_PROTEINS -c CYTOPLASM_PROTEINS -o OUTPUT

plpred-preprocess: data preprocessing tool

optional arguments:
  -h, --help            show this help message and exit
  -m MEMBRANE_PROTEINS, --membrane_proteins MEMBRANE_PROTEINS
                        path to the file containing membrane proteins (.fasta)
  -c CYTOPLASM_PROTEINS, --cytoplasm_proteins CYTOPLASM_PROTEINS
                        path to the file containing cytoplasm proteins (.fasta)
  -o OUTPUT, --output OUTPUT
                        path to the output file (.csv)

plpred-train:

usage: plpred-train [-h] -p PROCESSED_DATASET -o OUTPUT [-r]
                    [-a {random_forest,neural_network,gradient_boosting,svm}]

plpred-train: model training tool

optional arguments:
  -h, --help            show this help message and exit
  -p PROCESSED_DATASET, --processed_dataset PROCESSED_DATASET
                        processed dataset generated by plpred-preprocess (.csv)
  -o OUTPUT, --output OUTPUT
                        path to the output trained model (.pickle)
  -r, --report          show classification report
  -a {random_forest,neural_network,gradient_boosting,svm}, --algorithm {random_forest,neural_network,gradient_boosting,svm}
                        machine learning algorithm

plpred-predict:

usage: plpred-predict [-h] -i INPUT -o OUTPUT -m MODEL

plpred-predict: subcellular location prediction tool

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        input file (.fasta)
  -o OUTPUT, --output OUTPUT
                        output file (.csv)
  -m MODEL, --model MODEL
                        trained model (.pickle)

plpred-server:

usage: plpred-server [-h] -H HOST -p PORT -m MODEL

plpred-server: subcellular location prediction server

optional arguments:
  -h, --help            show this help message and exit
  -H HOST, --host HOST  host adress
  -p PORT, --port PORT  host port
  -m MODEL, --model MODEL
                        trained model to be deployed

Machine Learning - Models description:

(Standard) - RandomForestClassifier: A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

GradientBoostingClassifier: GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions.

MLPClassifier: Multi-layer Perceptron classifier. This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

C-Support Vector Classification: The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples.

About

A protein subcellular location prediction program.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published