title | author | date |
---|---|---|
Single-cell RNA-seq: Marker identification |
Mary Piper, Lorena Pantano, Meeta Mistry, Radhika Khetani |
Thursday, June 6,2019 |
Approximate time: 45 minutes
- Understand how to determine markers of individual clusters
- Understand the iterative processes of clustering and marker identification
Now that we have identified our desired clusters, we can move on to marker identification, which will allow us to verify the identity of certain clusters and help surmise the identity of any unknown clusters.
Goals:
- To determine the gene markers for each of the clusters
- To identify cell types of each cluster using markers
- To determine whether need to re-cluster based on cell type markers, perhaps clusters need to be merged or split
Challenges:
- Over-interpretation of the results
- Combining different types of marker identification
Recommendations:
- Think of the results as hypotheses that need verification. Inflated p-values can lead to over-interpretation of results (essentially each cell is used as a replicate). Top markers are most trustworthy. Identify all markers conserved between conditions for each cluster
- Identify markers that are differentially expressed between specific clusters
Remember that we had the following questions from the clustering analysis:
- What is the cell type identity of clusters 17 and 20?
- Do the clusters corresponding to the same cell types have biologically meaningful differences? Are there subpopulations of these cell types?
- Can we acquire higher confidence in these cell type identities by identifying other marker genes for these clusters?
Identifying conserved markers allows for identifying only those genes the are significantly differentially expressed relative to the other clusters for all conditions. This function performs differential gene expression testing for a single cluster against all other clusters within each group and then combines the p-values using meta-analysis methods from the MetaDE R package.
Before we start our marker identification we need to explicitly set our default assay. Because of the nature of how FindConservedMarkers()
works (i.e finding DE within each group and then looking for conservation), we need to use the original counts and not the integrated data.
DefaultAssay(combined) <- "RNA"
This function is used on multiple samples in lieu of FindAllMarkers()
. You could run it on all clusters if you wanted to, but it takes a while to run, so we are just going to run it on the unknown clusters 17 and 20.
The function FindConservedMarkers()
, has the following structure:
FindConservedMarkers()
syntax:
FindConservedMarkers(seurat_obj,
ident.1 = cluster,
grouping.var = "group",
only.pos = TRUE)
The function accepts a single cluster at a time, so if we want to have the function run on all clusters, then we can use the map
family of functions to iterate across clusters.
Since these functions will remove our row names (gene names), we need to transfer the row names to columns before mapping across clusters. We also need a column specifying to which cluster the significant genes correspond.
To do that we will create our own function to:
- Run the
FindConservedMarkers()
function - Transfer row names to a column using
rownames_to_column()
function - Create the column of cluster IDs using the
cbind()
function
# Create function to get conserved markers for any given cluster
get_conserved <- function(cluster){
FindConservedMarkers(combined,
ident.1 = cluster,
grouping.var = "sample",
only.pos = TRUE) %>%
rownames_to_column(var = "gene") %>%
cbind(cluster_id = cluster, .)
}
Since we want the output of the map
family of functions to be a dataframe with each cluster output bound together by rows, we will use the map_dfr()
function.
Remember the map family of functions uses the following syntax:
map
family syntax:
map_dfr(inputs_to_function, name_of_function)
Now, let's find the conserved markers for clusters 17 and 20.
# Iterate function across desired clusters
conserved_markers <- map_dfr(c(17,20), get_conserved)
NOTE: If you wanted to run this on all clusters, you could input
0:20
instead ofc(17,20)
; however, it would take quite a while to run.
To better analyze the output, we can include the gene descriptions as well.
# Extract the gene descriptions for each gene
gene_descriptions <- unique(annotations[, c("gene_name", "description")])
# Merge gene descriptions with markers
ann_conserved_markers <- left_join(x = conserved_markers,
y = gene_descriptions,
by = c("gene" = "gene_name"))
For clusters 17 and 20, we see many of the conserved enriched genes encode inhibitory receptors, such as TIGIT and LAG3, which can be indicative of exhausted T cells.
The last set of questions we had regarding the analysis involved whether the clusters corresponding to the same cell types have biologically meaningful differences. Sometimes the list of markers returned don't sufficiently separate some of the clusters.
Again, we performed this analysis with the single samples, so we are going to perform it by completing the exercises below.
Exercises
-
Determine differentiating markers for CD8+ T cells - clusters 6 versus 10 - using the
FindMarkers()
function. -
Annotate the markers with gene descriptions.
-
Reorder the columns to be in the order shown below.
-
Arrange rows by
avg_logFC
values -
Save our rearranged marker analysis results to a file called
cluster6vs10_markers.csv
in theresults
folder. -
Based on these marker results, determine whether we need to separate clusters 6 and 10 as their own clusters.
-
Extra credit: Repeat above steps for the clusters assigned to
Naive CD4+ T cells
, in addition to repeating forNaive B cells
.
Now taking in all of this information, we can surmise the cell types of the different clusters and plot the cells with cell type labels.
Cell Type | Clusters |
---|---|
CD14+ Monocytes | 0 |
FCGR3A+ Monocytes | 7 |
Conventional dendritic cells | 13 |
Plasmacytoid dendritic cells | 18 |
Naive B cells | 4, 15 |
Activated B cells | 14 |
Naive CD4+ T cells | 1, 2, 9, 16 |
Activated CD4+ T cells | 3 |
Naive CD8+ T cells | 11 |
Activated (cytotoxic) CD8+ T cells | 6, 10 |
NK cells | 5 |
Megakaryocytes | 12 |
Erythrocytes | 19 |
Stressed/dying cells | 8 |
Exhausted T cells | 17 |
Proliferating unknown | 20 |
We can then reassign the identity of the clusters to these cell types:
combined <- RenameIdents(object = combined,
"0" = "CD14+ monocytes",
"1" = "Naive CD4+ T cells",
"2" = "Naive CD4+ T cells",
"3" = "Activated CD4+ T cells",
"4" = "Naive B cells",
"5" = "NK cells",
"6" = "Activated (cytotoxic) CD8+ T cells",
"7" = "FCGR3A+ Monocytes",
"8" = "Stressed/dying cells",
"9" = "Naive CD4+ T cells",
"10" = "Activated (cytotoxic) CD8+ T cells",
"11" = "Naive CD8+ T cells",
"12" = "Megakaryocytes",
"13" = "Conventional dendritic cells",
"14" = "Activated B cells",
"15" = "Naive B cells",
"16" = "Naive CD4+ T cells",
"17" = "Exhausted T cells",
"18" = "Plasmacytoid dendritic cells",
"19" = "Erythrocytes",
"20" = "Proliferating unknown")
DimPlot(object = combined,
reduction = "umap",
label = TRUE,
pt.size = 0.5,
repel = T,
label.size = 6)
Exercises
- Remove the stressed cells using the
subset()
function. - Visualize the clusters using
DimPlot()
. - Use the
write_rds()
function to save the final labelledcombined
object to theresults
folder, calledcombined_labelled_res0.8.rds
.
Now that we have our clusters defined and the markers for each of our clusters, we have a few different options:
- Experimentally validate intriguing markers for our identified cell types.
- Perform differential expression analysis between conditions
ctrl
andstim
- Biological replicates are necessary to proceed with this analysis.
- Trajectory analysis, or lineage tracing, could be performed if trying to determine the progression between cell types or cell states. For example, we could explore any of the following using this type of analysis:
- Differentiation processes
- Expression changes over time
- Cell state changes in expression
Explore the answer key for the exercises in this lesson
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- A portion of these materials and hands-on activities were adapted from the Satija Lab's Seurat - Guided Clustering Tutorial