Diversifying the pipeline for identifying bulk RNA-seq derived biomarkers of cancer within single cell populations

Hackathon team: Sara Grimm, Jason Wang, Miko Liu, Matt Bernstein

Background and Objective

Previous work by Matt Bernstein (and others) explored scRNA-seq data of 8 high-grade glioma tumor samples from "Single-cell transcriptome analysis of lineage diversity in high-grade glioma" by Yuan et al (PMID: 30041684) paper. See their repository for their results.

Our primary objective is to extend their analysis by developing a method to stratify cells in a given scRNA-seq dataset according to malignancy status. For this stratification we rely primarily on a CNV (copy number variation) metric.

We also hoped to (a) identify gene markers or gene sets correlated with malignancy status, and (b) associate specific cell types from clustered scRNA-seq data with malignancy status. Unfortunately, despite several attempted methods, we were not successful in these endeavors.

Workflow

Workflow Steps and Code Bits

Step 1: Seurat

For each tumor sample, run Seurat to extract the following: normalized read counts (SCT) per gene per cell, assigned cluster per cell, UMAP coordinates, average expression per gene per cluster, and marker genes per cluster. The R workflow using Seurat can be found at working_data/seurat_simple/{sampleID}/protocol-{sampleID}.Rtxt

Step 2: Process inferCNV data.

(Thanks to Matt B for providing inferCNV outputs for each of the 8 tumor samples using InferCNV.) From the ~.observations.txt and ~.references.txt files output from inferCNV, the summed and average |locusScore-medianScore| were calculated over all available loci to provide aggregate CNV metrics per cell.

  id = samples[S];
  txtfile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  out=paste("cellID", "cellGroup", "aggregate", "locusCt", "median", sep="\t");
  write.table(out, file=txtfile, sep="\t", quote=F, append=FALSE, row.names=F, col.names=F);
  infileR=paste(id, ".infercnv.references.txt", sep="");
  dataR=read.delim(infileR, header=TRUE, row.names=1);
  nC=ncol(dataR);  # cells
  nL=nrow(dataR);  # loci
  medianR=median(unlist(dataR));
  for (N in 1:nC) {
    dd=dataR[,N];
    delta=abs(dd-medianR);
    deltaSum=sum(delta);
    thiscell=colnames(dataR)[N];
    out=paste(thiscell, "reference", deltaSum, nL, medianR, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }
  infileT=paste(id, ".infercnv.observations.txt", sep="");
  dataT=read.delim(infileT, header=TRUE, row.names=1);
  nC=ncol(dataT);  # cells
  nL=nrow(dataT);  # loci
  medianT=median(unlist(dataT));
  for (N in 1:nC) {
    dd=dataT[,N];
    delta=abs(dd-medianT);
    deltaSum=sum(delta);
    thiscell=colnames(dataT)[N];
    out=paste(thiscell, "sample", deltaSum, nL, medianT, sep="\t");
    write.table(out, file=txtfile, sep="\t", quote=F, append=TRUE, row.names=F, col.names=F);
  }

Then the distribution of the average |locusScore-medianScore| results were plotted to see if the reference and tumor samples were different. (They are!) After repeating over the 8 tumors, we noted that the distribution of these scores in the reference cells are consistently below 0.02, so we are using this as the threshold for what we are confident(???) are non-malignant cells. These plots can be viewed in working_data/inferCNV_distr.

  id = samples[S];
  infile=paste(id, ".aggr_medianDelta_per_cell.txt", sep="");
  data=read.delim(infile, header=TRUE);
  ddR=subset(data, (data$cellGroup == "reference"));
  ddS=subset(data, (data$cellGroup == "sample"));
  loci=ddR[1,4];
  xmin=0; xmax=0.08;
  pngfile=paste(id, ".avg_medianDelta_per_cell.distr.png", sep="");
  png(pngfile, h=500, w=500, res=120);
  maintitle=paste(id, ": average |score-median|\n(", loci, " inferCNV loci)", sep="");
  plot(density(ddR$aggregate/loci), lwd=2, col="black", main=maintitle, xlab="", xlim=c(xmin,xmax), ylim=c(0,yy[S]));
  lines(density(ddS$aggregate/loci), lwd=2, col="limegreen");
  legend("topright", c("reference","sample"), col=c("black","limegreen"), pch=20);
  dev.off();

Step 3: Combine Seurat cell cluster and inferCNV data.

We then integrated the cell cluster assingments from Seurat with the aggregate inferCNV score we calculated, then visualized via UMAP to look at relative location of low/high CNV cells compared to cells with high expression of proliferation marker genes.

Screenshot Examples and Output

yellow: non-malignant
blue: malignant

Additional Visualizations

Annotation of cell clusters using PlangaoDB

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

Expression of selected markers

Future Directions

Moving forward, we would like to add gene set enrichment analysis to use enriched gene sets consistent with the caner phenotype, including proliferation and developmental pathways. This would provide a more rigorous statistical score for each cluster as opposed to the proliferation score currently used.
Furthermore, testing other tools for annotating gene clusters with their respective cell types would aid in identifying non-malignant clusters.
Examine specific genes for abnormal CNV from the inferred CNV data (TP53, MDM2, RTK, RAS, PI3K, RB, CDK4)
Use identified cancer cells on the scRNA-Seq dataset integrating all 8 donors to identify an improved boundary (such as using supporter vector machines) rather than the principal components 1 & 2 defined in the paper.

Dependencies

Seurat

Input data

Find the scRNA-seq data and inferred copy number variant data in input/data. Within working_data, several visualizations have been generated from the scRNA-Seq UMAPs mapping inferred CNV onto the sRNA-Seq UMAPs (umap_color_by_inferCNV_rank, umap_color_by_inferCNV_assumedNonMalignant), proliferation score, and biomarker expression.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
batch_normalization		batch_normalization
cluster_annotation		cluster_annotation
input_data		input_data
working_data		working_data
Diversifying-the-pipeline-for-identifying-bulk-RNA-seq-derived-biomarkers-of-cancer-risk-within-sing.Rproj		Diversifying-the-pipeline-for-identifying-bulk-RNA-seq-derived-biomarkers-of-cancer-risk-within-sing.Rproj
LICENSE		LICENSE
README.md		README.md
kuang_seurat.rmd		kuang_seurat.rmd
quant.sf		quant.sf
seurat-all_markers.rds		seurat-all_markers.rds
seurat-all_markers.txt		seurat-all_markers.txt
seurat-integrated-downsampled_100.rds		seurat-integrated-downsampled_100.rds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diversifying the pipeline for identifying bulk RNA-seq derived biomarkers of cancer within single cell populations

Background and Objective

Workflow

Workflow Steps and Code Bits

Screenshot Examples and Output

Additional Visualizations

Annotation of cell clusters using PlangaoDB

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

Expression of selected markers

Future Directions

Dependencies

Input data

About

Releases

Packages

Contributors 3

License

NCBI-Codeathons/Diversifying-the-pipeline-for-identifying-bulk-RNA-seq-derived-biomarkers-of-cancer-risk-within-sing

Folders and files

Latest commit

History

Repository files navigation

Diversifying the pipeline for identifying bulk RNA-seq derived biomarkers of cancer within single cell populations

Background and Objective

Workflow

Workflow Steps and Code Bits

Screenshot Examples and Output

Additional Visualizations

Annotation of cell clusters using PlangaoDB

Cells with aggregate inferCNV score in range of reference (assumed non-malignant (red)) cells.

Expression of selected markers

Future Directions

Dependencies

Input data

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages