-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathMOFA_scNMT_tutorial_no_solutions.Rmd
448 lines (343 loc) · 18 KB
/
MOFA_scNMT_tutorial_no_solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
---
title: "MOFA applied to the gastrulation scNMT-seq data set"
author:
- name: "Ricard Argelaguet"
affiliation: "European Bioinformatics Institute, Cambridge, UK"
email: "ricard@ebi.ac.uk"
- name: "Britta Velten"
affiliation: "German Cancer Research Center, Heidelberg, Germany"
email: "b.velten@dkfz-heidelberg.de"
date: "`r Sys.Date()`"
output:
BiocStyle::html_document:
toc: true
toc_depth: 1
---
# Description
This tutorial demonstrates how to use MOFA for the integration of single-cell multi-omics data. We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA provides a method for delineating coordinated variation between the transcriptome and the epigenome.
The data set we use here is a simplified version of the original data set (only E7.5 cells) published in [Nature](https://www.nature.com/articles/s41586-019-1825-8). The full data set can be downloaded from [this FTP](http://ftp.ebi.ac.uk/pub/databases/scnmt_gastrulation).
In this vignette we skip all the data processing details as well as the model training part. We focus on the downstream characterisation of a pretrained MOFA model. If you are interested in the code to train the model from the data linked above, have a look at the scripts in the [train_MOFA](https://github.com/PMBio/SingleCellCourse/tree/master/day2/MOFA/train_model) folder.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align="center")
```
# Load libraries
Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages
```{r, message=FALSE}
library(data.table) # fast manipulation of data.frames
library(purrr) # pipes to make the code more readable
library(ggplot2)
library(MOFA2)
```
Define cell type colors for the visualisations
```{r}
colors <- c(
"Mesoderm" = "#CD3278",
"Endoderm" = "#43CD80",
"Ectoderm" = "steelblue"
)
```
# Load pre-computed MOFA model
As input to the model we quantified DNA methylation and chromatin accessibility values over two different sets of regulatory elements: gene promoters and enhancer elements. RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout.
Note that for this tutorial we selected the MOFA Factors that explained at least 1% of variation in the RNA expression. The trained model can be found here [ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds](ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds).
```{r}
MOFAobject <- readRDS("gastrulation_data/MOFAmodel.rds")
# MOFAobject <- readRDS(url("ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds"))
MOFAobject
```
Explore the cell metadata:
- **sample**: cell ID
- **stage**: developmental stage.
- **lineage**: cell type annotation (derived from mapping the cells to the [10x reference atlas](https://www.nature.com/articles/s41586-019-0933-9)).
- **pass_rnaQC**: did the cell pass QC for RNA expression?.
- **pass_metQC**: did the cell pass QC for DNA methylation? `NA` if the cell was only profiled for RNA.
- **pass_accQC**: did the cell pass QC for chromatin accessibility? `NA` if the cell was only profiled for RNA.
- **group**: ignore this column
```{r}
head(samples_metadata(MOFAobject))
```
**(Q) How many cells from each lineages are in the object?**
Notice that there a lot of cells that only have RNA expression measurements. One of the major advantages of MOFA is that it handles missing values, so we don't have to remove these cells prior to model training
**(Q) How many cells do not have DNA methylation measurements?**
# Overview of training data
The function `plot_data_overview` can be used to obtain an overview of the input data.
It shows how many views (rows) and how many cells (columns) exist, what are their corresponding dimensionalities are. It also shows which views each cell is missing.
```{r fig.align="center"}
view.colors <- c(
"RNA" = "#3CB54E",
"Enhancers accessibility" = "#00BFC4",
"Promoters accessibility" = "#00BFC4",
"Enhancers methylation" = "#F37A71",
"Promoters methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAobject)]
plot_data_overview(MOFAobject, colors = view.colors)
```
## Visualise RNA expression data.
Notice that RNA expression has already been normalised (using `scran`), but the distribution looks zero-inflated. This is a statistical challenge for *some* single-cell RNA-seq assays. To model this data we probably want to use a zero-inflated Gaussian distribution, but this not implemented in the MOFA framework. Instead, we used a Gaussian distribution, which should provide a decent approximation but it will likely underestimate the mean expression values.
```{r}
rna <- get_data(MOFAobject, views="RNA")[[1]][[1]]
hist(rna)
```
## Visualise DNA methylation and chromatin accessibility data.
As seen in the previous session, we use M-values instead of Beta-values.
However, when looking at the distribution of M-values we can notice that the epigenetic modalities are not well modeled by a Gaussian likelihood. Ideally this data modality should be modeled with a binomial distribution: for every cell $i$ and region $j$ the total number of CpGs correspond to the total number of trials and the number of methylated CpGs to the number of successes. Unfortunately, the binomial likelihood is not implemented in the MOFA framework and thus we need to calculate M-values that can be approximated with a Gaussian distribution.
```{r}
met <- get_data(MOFAobject, views="Promoters methylation")[[1]][[1]]
hist(met)
```
```{r}
acc <- get_data(MOFAobject, views="Promoters accessibility")[[1]][[1]]
hist(acc)
```
# Overview of the MOFA model
## Correlation between factors
A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit.
```{r}
plot_factor_cor(MOFAobject)
```
## Variance decomposition analysis
The most important insight that MOFA generates is the variance decomposition analysis using `plot_variance_explained`. This plot shows the percentage of variance explained by each factor in each data modality. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.
```{r}
plot_variance_explained(MOFAobject) +
theme(
axis.text.x = element_text(angle=25, hjust=1, vjust=1.05)
)
```
**(Q) What insights from the data can we learn just from inspecting this plot?**
**(Q) What do you notice about promoters?**
# Characterisation of Factors
There are a few systematic strategies to characterise the molecular signal that underlies each MOFA factor:
- **Association analysis between the sample metadata and the Factor values**: function `correlate_factors_with_covariates`
- **Inspection of factor values**: functions `plot_factor` (one factor at a time) and `plot_factors` (combinations of factors)
- **Inspection of the feature weights**: functions `plot_weights` (all weights), `plot_top_weights` (only the top weights)
- **Gene set enrichment analysis on the mRNA weights**: functions `run_enrichment`, followed by `plot_enrichment`.
## Characterisation of Factor 1
### Visualisation of Factor values
Plotting Factor 1 values and colouring cells by lineage assignment cleartly shows that this factor captures the variation that is associated with the separation between Mesoderm (positive Factor values) and non-Mesoderm cells (negative Factor values).
```{r}
plot_factor(MOFAobject,
factor = 1,
color_by = "lineage",
add_violin = TRUE,
dodge = TRUE
) + scale_fill_manual(values=colors)
```
**How do we interpret the factor values?**
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.
### Visualisation of RNA weights
The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa.
Let's plot the distribution of weights for Factor 1.
```{r, warnings=FALSE, message=FALSE}
plot_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10, # Top number of features to highlight
scale = T # Scale weights from -1 to 1
)
```
If you are not interested in the full distribution, but just on the top weights, you can instead do:
```{r}
plot_top_weights(MOFAobject,
view = "RNA",
factor = 1,
nfeatures = 10,
scale = T,
abs = T
)
```
We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes **with positive weight**:
```{r}
genes <- c("Phlda2","Mesp1")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
```
Similarly, we expect that genes with large negative weights For Factor 1 to be lowly expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes **with negative weight**:
```{r}
genes <- c("Cldn6","Pim2")
for (i in genes) {
plot_factor(MOFAobject,
factor = 1,
dot_size = 2.5,
group_by = "lineage",
color_by = i
) %>% print
}
```
## Visualisation of RNA expression patterns in the high-dimensional space
The weights are useful to identify which genes are driving each factors. After inspecting the weights it is good practice to go back to the high-dimensional space and check if the variability that MOFA captures is real.
For example, one could generate a heatmap plot of the RNA expression for the top genes, where samples are sorted by the corresponding factor values. This is the aim of the `plot_data_heatmap` function:
```{r}
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`,
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE
)
```
An interesting option of `plot_data_heatmap` is to plot "denoised" observations. This is obtained by reconstructing the data using the matrix factorisation equation from MOFA:
$$\hat{\mathbf{Y}}^m = \mathbf{W}^m\mathbf{Z}$$
where $\mathbf{W}^m$ is the weight matrix for the $m$-th view, and $\mathbf{Z}$ is the (shared) factor matrix.
This data reconstruction step essentially removes all the variation that is not captured by the model:
```{r}
plot_data_heatmap(MOFAobject,
view = "RNA",
factor = 1,
denoise = TRUE,
features = 25,
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F
)
```
## Visualisation of DNA methylation weights
As we have done with RNA, we can also visualise the distribution of weights for the epigenetic modalities. The problem about this is that the large majority of enhancers are not well annotated and we only have the genomic coordinates for them...
```{r}
plot_weights(MOFAobject,
view = c("Enhancers methylation"),
factor = 1,
nfeatures = 5,
scale = F
)
```
## Visualisation of DNA methylation patterns in the high-dimensional space
As done with the RNA above, let's visualise in the high-dimensional space the DNA methylation variation that MOFA captures using the `plot_data_heatmap` function. Notice how noisy and sparse DNA methylation data is.
```{r, out.width="130%"}
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
```
We will use MOFA to impute the missing values. This is based on the data reconstruction equation shown above.
```{r}
MOFAobject <- impute(MOFAobject)
```
Plot heatmap with `impute=TRUE` argument. Despite the gaussian likelihood model not being optimal for DNA methylation data, this heatmap looks much better!
```{r, out.width="130%"}
plot_data_heatmap(MOFAobject,
view = "Enhancers methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
```
As we guessed from the variance decomposition analysis, the promoters do not display interesting signal during germ layer commitment
```{r, out.width="130%"}
plot_data_heatmap(MOFAobject,
view = "Promoters methylation",
factor = 1,
impute = TRUE,
features = 25,
annotation_samples = "lineage",
# extra arguments passed to `pheatmap`
show_colnames = F, cluster_cols = F,
annotation_colors = list("lineage"=colors),
annotation_legend = FALSE,
fontsize = 6
)
```
## Characterisation of Factor 2
**(Q) Your task is to provide a characterisation for Factor 2**.
Try a similar pipeline as for Factor 1 and answer the following questions:
- Which germ layer underlies Factor 2?
- Can you identify mRNA markers? Validate them using the [scRNA-seq reference atlas](https://marionilab.cruk.cam.ac.uk/MouseGastrulation2018)
- Generate a heatmap that displays the DNA methylation variation (try with and without imputation of missing values, see the function `impute`)
## Plot combinations of Factors
This scatter plot captures in two dimensions all the variation that is required to separate the three germ layers! It corresponds to [Figure 2b in the paper](https://www.nature.com/articles/s41586-019-1825-8).
```{r}
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "lineage",
dot_size = 2,
legend = TRUE
) + scale_fill_manual(values=colors)
```
We can colour this plot by the molecular measurements, for example gene expression:
```{r}
# Ectoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Pim2",
dot_size = 2
)
# Mesoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Mesp1",
dot_size = 2
)
# Endoderm marker
plot_factors(MOFAobject,
factors = c(1,2),
color_by = "Foxa2",
dot_size = 2
)
```
# Zero-inflation in scRNA-seq data
**(Q) Extract (1) the RNA expression data from the mofa model using the `get_data` function, and (2) the MOFA RNA expression predictions using the `predict` function. Then, plot a histogram of the two matrices. What do you notice?**
# Are the changes in DNA methylation and chromatin accessibility correlated at the loci level?
Using MOFA we have identified coordinated axis of variation between the three omics (Factor1 and Factor2). However, this does not necessarily imply that the *same* features are driving the signal in all of the omics. It could be that DNA methylation changes associated with endoderm commitment (Factor 2) are driven by enhancers A,B,C whereas chromatin accessibility changes are driven by enhancers D,E,F.
To explore this, we will correlate the weights for both epigenetc modalities. This cannot be done internally within MOFA so we have to extract the weights manually do our own downstream analysis This is done with the function `get_weights`:
```{r}
# Fetch weights
w.met <- get_weights(MOFAobject,
factors = c(1,2),
views = "Enhancers methylation",
scale = TRUE,
as.data.frame = T
) %>% as.data.table
w.acc <- get_weights(MOFAobject,
factors = c(1,2),
views = "Enhancers accessibility",
scale = TRUE,
as.data.frame = T
) %>% as.data.table
# Remove the met_ and acc_ prefix from the feature names
w.met[,feature:=substr(feature,5,nchar(as.character(feature)))]
w.acc[,feature:=substr(feature,5,nchar(as.character(feature)))]
# Merge weights
w.dt <- merge(
w.met[,c("feature","factor","value")],
w.acc[,c("feature","factor","value")],
by=c("feature","factor")
)
```
Notice how the DNA methylation and chromatin accessibility weighs are highly correlated, suggesting that the variability in both epigenetic layers is highly coupled not just globally but also at the individual loci level.
```{r}
ggplot(w.dt, aes(x=value.x, y=value.y)) +
geom_point() +
stat_smooth(method="lm") +
theme_classic() +
facet_wrap(~factor, scales="free") +
labs(x="DNA methylation weight", y="Chromatin accessibility weight")
```
# Optional: Using pseudo-time annotations of cells with MEFISTO
In the lectures you have also learnt about extensions of MOFA to account for temporal or spatial information on the samples. You will hear more about spatial data tomorrow. Time course data at a good temporal resolution is still rare for single cell studies. In the full scNMT-seq gastrulation data we have 4 different time points, which does not provide enough temporal resolution for MEFISTO but instead we can consider using a measure of *pseudotime* and use this in MEFISTO to distinguish patterns of variation that vary smoothly along pseudotime from other non-smooth sources of variation, e.g. cell cycle.
For this we can train a similar model on the full data from all 4 stages and simply additionally pass the pseudo-time values for each cell to MOFA. If you are interested in how this works and what other types of down-stream analysis are possible have a look at
[this tutorial](https://raw.githack.com/bioFAM/MEFISTO_tutorials/master/scnmt_mefisto_vignette.html).
# sessionInfo
```{r}
sessionInfo()
```