Skip to content

NCBI-Codeathons/Diversifying-the-pipeline-for-identifying-bulk-RNA-seq-derived-biomarkers-of-cancer-risk-within-sing

Repository files navigation

Diversifying the pipeline for identifying bulk RNA-seq derived biomarkers of cancer within single cell populations

Hackathon team: Sara Grimm, Jason Wang, Miko Liu, Matt Bernstein

Background and Objective

Previous work by Matt Bernstein (and others) explored scRNA-seq data of 8 high-grade glioma tumor samples from "Single-cell transcriptome analysis of lineage diversity in high-grade glioma" by Yuan et al (PMID: 30041684) paper. See their repository for their results.

Our primary objective is to extend their analysis by developing a method to stratify cells in a given scRNA-seq dataset according to malignancy status. For this stratification we rely primarily on a CNV (copy number variation) metric.

We also hoped to (a) identify gene markers or gene sets correlated with malignancy status, and (b) associate specific cell types from clustered scRNA-seq data with malignancy status. Unfortunately, despite several attempted methods, we were not successful in these endeavors.

Workflow

image

Workflow Steps and Code Bits

Step 1: Seurat

For each tumor sample, run Seurat to extract the following: normalized read counts (SCT) per gene per cell, assigned cluster per cell, UMAP coordinates, average expression per gene per cluster, and marker genes per cluster. The R workflow using Seurat can be found at working_data/seurat_simple/{sampleID}/protocol-{sampleID}.Rtxt

Step 2: Process inferCNV data.

(Thanks to Matt B for providing inferCNV outputs for each of the 8 tumor samples using InferCNV.) From the ~.observations.txt and ~.references.txt files output from inferCNV, the summed and average |locusScore-medianScore| were calculated over all available loci to provide aggregate CNV metrics per cell.

  id = samples[S];
  txtfile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  out=paste("cellID", "cellGroup", "aggregate", "locusCt", "median", sep="\t");
  write.table(out, file=txtfile, sep="\t", quote=F, append=FALSE, row.names=F, col.names=F);
  infileR=paste(id, ".infercnv.references.txt", sep="");
  dataR=read.delim(infileR, header=TRUE, row.names=1);
  nC=ncol(dataR);  # cells
  nL=nrow(dataR);  # loci
  medianR=median(unlist(dataR));
  for (N in 1:nC) {
    dd=dataR[,N];
    delta=abs(dd-medianR);
    deltaSum=sum(delta);
    thiscell=colnames(dataR)[N];
    out=paste(thiscell, "reference", deltaSum, nL, medianR, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }
  infileT=paste(id, ".infercnv.observations.txt", sep="");
  dataT=read.delim(infileT, header=TRUE, row.names=1);
  nC=ncol(dataT);  # cells
  nL=nrow(dataT);  # loci
  medianT=median(unlist(dataT));
  for (N in 1:nC) {
    dd=dataT[,N];
    delta=abs(dd-medianT);
    deltaSum=sum(delta);
    thiscell=colnames(dataT)[N];
    out=paste(thiscell, "sample", deltaSum, nL, medianT, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }

Then the distribution of the average |locusScore-medianScore| results were plotted to see if the reference and tumor samples were different. (They are!) After repeating over the 8 tumors, we noted that the distribution of these scores in the reference cells are consistently below 0.02, so we are using this as the threshold for what we are confident(???) are non-malignant cells. These plots can be viewed in working_data/inferCNV_distr.

  id = samples[S];
  infile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  data=read.delim(infile, header=TRUE);
  ddR=subset(data, (data$cellGroup == "reference"));
  ddS=subset(data, (data$cellGroup == "sample"));
  loci=ddR[1,4];
  xmin=0; xmax=0.08;
  pngfile=paste(id, ".avg_medianDelta_per_cell.distr.png", sep="");
  png(pngfile, h=500, w=500, res=120);
  maintitle=paste(id, ": average |score-median|\n(", loci, " inferCNV loci)", sep="");
  plot(density(ddR$aggregate/loci), lwd=2, col="black", main=maintitle, xlab="", xlim=c(xmin,xmax), ylim=c(0,yy[S]));
  lines(density(ddS$aggregate/loci), lwd=2, col="limegreen");
  legend("topright", c("reference","sample"), col=c("black","limegreen"), pch=20);
  dev.off();

Step 3: Combine Seurat cell cluster and inferCNV data.

We then integrated the cell cluster assingments from Seurat with the aggregate inferCNV score we calculated, then visualized via UMAP to look at relative location of low/high CNV cells compared to cells with high expression of proliferation marker genes.

Screenshot Examples and Output

yellow: non-malignant
blue: malignant

PJ016 UMAP-byRankICNVUMAP_proliferation_PJ016

PJ017 UMAP-byRankICNVUMAP_proliferation_PJ017

PJ018 UMAP-byRankICNVUMAP_proliferation_PJ018

PJ025 UMAP-byRankICNVUMAP_proliferation_PJ025

PJ030 UMAP-byRankICNVUMAP_proliferation_PJ030

PJ032 UMAP-byRankICNVUMAP_proliferation_PJ032

PJ035 UMAP-byRankICNVUMAP_proliferation_PJ035

PJ048 UMAP-byRankICNVUMAP_proliferation_PJ048

Additional Visualizations

Annotation of cell clusters using PlangaoDB

image

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

image

Expression of selected markers

biomarker_expression-SOX2 biomarker_expression-OLIG2

Future Directions

  • Moving forward, we would like to add gene set enrichment analysis to use enriched gene sets consistent with the caner phenotype, including proliferation and developmental pathways. This would provide a more rigorous statistical score for each cluster as opposed to the proliferation score currently used.
  • Furthermore, testing other tools for annotating gene clusters with their respective cell types would aid in identifying non-malignant clusters.
  • Examine specific genes for abnormal CNV from the inferred CNV data (TP53, MDM2, RTK, RAS, PI3K, RB, CDK4)
  • Use identified cancer cells on the scRNA-Seq dataset integrating all 8 donors to identify an improved boundary (such as using supporter vector machines) rather than the principal components 1 & 2 defined in the paper.

Dependencies

  • Seurat

Input data

Find the scRNA-seq data and inferred copy number variant data in input/data. Within working_data, several visualizations have been generated from the scRNA-Seq UMAPs mapping inferred CNV onto the sRNA-Seq UMAPs (umap_color_by_inferCNV_rank, umap_color_by_inferCNV_assumedNonMalignant), proliferation score, and biomarker expression.

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •