scGMix a Pipeline for Single Cell Gaussian Mixture Models

scGmix is a tool written in Python and designed for intuitively discovering cell states from scRNA-seq datasets. The pipeline seamlessly integrates multiple functionalities, including data preprocessing with quality control and appropriate normalization, dimensionality reduction techniques such as principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), or uniform manifold approximation and projection (UMAP), and cell clustering using Gaussian Mixture Models (GMMs). GMM clustering can be performed using either pre-clustering component means computed through the tools offered by scGmix, or automatically precomputed component means using the integrated optuna optimization library. While some components of the pipeline require further tuning, scGmix achieves a balanced approach between automated processes, user preferences, and interpretability, thus we believe it is a valuable tool for users who wish to identify cell states based on their specific requirements.

A full technical report of the pipeline tested on synthetic single-cell datasets is available in Technical_Report.pdf.

Library Dependencies:

numpy
scanpy
anndata
matplotlib
seaborn
scikit-learn
kneed
pickle
optuna

File Dependancies:

./utils/plotting.py
./utils/optimization.py

Please make sure you have these dependencies installed before running the pipeline.

Pipeline overview

Usage

To use the scgmix pipeline, follow the steps below:

Import the necessary libraries:

import numpy as np
import scanpy as sc
import anndata as adata
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import SpectralClustering
from sklearn.mixture import GaussianMixture
from kneed import KneeLocator
import warnings
warnings.filterwarnings("ignore")
import pickle

Import the required file dependencies:

from utils.optimization import optimizeGMM, optimizeSpectral
from utils.plotting import plot_bic, make_ellipses_joint, posterior_heatmap, plot_state_cellsums, plot_pca

Instantiate an scgmix object and provide the required inputs:

pipeline = scgmix(adata, method="PCA", rand_seed=42)

Preprocess the data:

pipeline.preprocess(mads_away=5, feature_selection=False, min_mean=0.0125, max_mean=3, min_disp=0.5)

Perform dimensionality reduction:

pipeline.dimreduction(n_pcs=100, pc_selection_method="screeplot", n_neighbors=15, min_dist=0.1,
                      use_highly_variable=False, variance_threshold=90, verbose=True, plot_result=False)

Perform clustering:

pipeline.mix(preclustering_method="spectral", enable_preclustering=False, leiden_resolution=1.0,
             criterion="BIC", n_trials=100, verbose=True, max_iter=1000, max_num_components=5, user_means=None, show_progress_bar=True)

Additional Methods The scgmix class also provides additional utility methods:

pipeline.savefile(filenamepath): # Save the processed data to a file.
pipeline.savemodel(filenamepath): # Save the trained GMM model to a file.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
datasets		datasets
models		models
processed_datasets		processed_datasets
utils		utils
.gitignore		.gitignore
README.md		README.md
Technical_Report.pdf		Technical_Report.pdf
a_preprocessing.ipynb		a_preprocessing.ipynb
b_dimreduction.ipynb		b_dimreduction.ipynb
c_clustering.ipynb		c_clustering.ipynb
d_visualization.ipynb		d_visualization.ipynb
results.ipynb		results.ipynb
scGmix.py		scGmix.py
schematic.png		schematic.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scGMix a Pipeline for Single Cell Gaussian Mixture Models

Pipeline overview

Usage

About

Releases

Packages

Languages

KyriakosPsa/single-cell-ML-pipeline

Folders and files

Latest commit

History

Repository files navigation

scGMix a Pipeline for Single Cell Gaussian Mixture Models

Pipeline overview

Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages