day2/MOFA_scNMT_tutorial_no_solutions.Rmd

---
title: "MOFA applied to the gastrulation scNMT-seq data set"
author: 
- name: "Ricard Argelaguet"
  affiliation: "European Bioinformatics Institute, Cambridge, UK"
  email: "ricard@ebi.ac.uk"
- name: "Britta Velten"
  affiliation: "German Cancer Research Center, Heidelberg, Germany"
  email: "b.velten@dkfz-heidelberg.de"
date: "`r Sys.Date()`"

output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 1
---

# Description

This tutorial demonstrates how to use MOFA for the integration of single-cell multi-omics data.   We consider a dataset where scNMT-seq was used to simultaneously profile RNA expression, DNA methylation and chromatin accessibility in 1,828 cells at multiple stages of mouse development. MOFA provides a method for delineating coordinated variation between the transcriptome and the epigenome.

The data set we use here is a simplified version of the original data set (only E7.5 cells) published in [Nature](https://www.nature.com/articles/s41586-019-1825-8). The full data set can be downloaded from [this FTP](http://ftp.ebi.ac.uk/pub/databases/scnmt_gastrulation).  

In this vignette we skip all the data processing details as well as the model training part. We focus on the downstream characterisation of a pretrained MOFA model. If you are interested in the code to train the model from the data linked above, have a look at the scripts in the [train_MOFA](https://github.com/PMBio/SingleCellCourse/tree/master/day2/MOFA/train_model) folder.

```{r global_options, include=FALSE}
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align="center")
```

# Load libraries

Load dependencies. Make sure that MOFA2 is imported last, to avoid collisions with functions from other packages
```{r, message=FALSE}
library(data.table)  # fast manipulation of data.frames
library(purrr)       # pipes to make the code more readable
library(ggplot2)
library(MOFA2)
```

Define cell type colors for the visualisations
```{r}
colors <- c(
  "Mesoderm" = "#CD3278",
  "Endoderm" = "#43CD80",
  "Ectoderm" = "steelblue"
)
```


# Load pre-computed MOFA model

As input to the model we quantified DNA methylation and chromatin accessibility values over two different sets of regulatory elements: gene promoters and enhancer elements. RNA expression was quantified over protein-coding genes. After data processing, separate views were defined for the RNA expression and for each combination of genomic context and epigenetic readout.

Note that for this tutorial we selected the MOFA Factors that explained at least 1% of variation in the RNA expression. The trained model can be found here [ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds](ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds).

```{r}
MOFAobject <- readRDS("gastrulation_data/MOFAmodel.rds")
# MOFAobject <- readRDS(url("ftp://ftp.dkfz-heidelberg.de/outgoing/SCCourse2021/day_2_data/MOFAmodel.rds"))
MOFAobject
```

Explore the cell metadata:  
- **sample**: cell ID  
- **stage**: developmental stage.  
- **lineage**: cell type annotation (derived from mapping the cells to the [10x reference atlas](https://www.nature.com/articles/s41586-019-0933-9)).  
- **pass_rnaQC**: did the cell pass QC for RNA expression?.  
- **pass_metQC**: did the cell pass QC for DNA methylation? `NA` if the cell was only profiled for RNA.  
- **pass_accQC**: did the cell pass QC for chromatin accessibility? `NA` if the cell was only profiled for RNA.  
- **group**: ignore this column

```{r}
head(samples_metadata(MOFAobject))
```

**(Q) How many cells from each lineages are in the object?**


Notice that there a lot of cells that only have RNA expression measurements. One of the major advantages of MOFA is that it handles missing values, so we don't have to remove these cells prior to model training

**(Q) How many cells do not have DNA methylation measurements?**


# Overview of training data
The function `plot_data_overview` can be used to obtain an overview of the input data. 
It shows how many views (rows) and how many cells (columns) exist, what are their corresponding dimensionalities are. It also shows which views each cell is missing.
```{r fig.align="center"}
view.colors <- c(
  "RNA" = "#3CB54E",
  "Enhancers accessibility" = "#00BFC4",
  "Promoters accessibility" = "#00BFC4",
  "Enhancers methylation" = "#F37A71",
  "Promoters methylation" = "#F37A71"
)
view.colors = view.colors[views_names(MOFAobject)]

plot_data_overview(MOFAobject, colors = view.colors)
```

## Visualise RNA expression data. 

Notice that RNA expression has already been normalised (using `scran`), but the distribution looks zero-inflated. This is a statistical challenge for *some* single-cell RNA-seq assays. To model this data we probably want to use a zero-inflated Gaussian distribution, but this not implemented in the MOFA framework. Instead, we used a Gaussian distribution, which should provide a decent approximation but it will likely underestimate the mean expression values.
```{r}
rna <- get_data(MOFAobject, views="RNA")[[1]][[1]]
hist(rna)
```

## Visualise DNA methylation and chromatin accessibility data. 

As seen in the previous session, we use M-values instead of Beta-values.  
However, when looking at the distribution of M-values we can notice that the epigenetic modalities are not well modeled by a Gaussian likelihood. Ideally this data modality should be modeled with a binomial distribution: for every cell $i$ and region $j$ the total number of CpGs correspond to the total number of trials and the number of methylated CpGs to the number of successes. Unfortunately, the binomial likelihood is not implemented in the MOFA framework and thus we need to calculate M-values that can be approximated with a Gaussian distribution.
```{r}
met <- get_data(MOFAobject, views="Promoters methylation")[[1]][[1]]
hist(met)
```

```{r}
acc <- get_data(MOFAobject, views="Promoters accessibility")[[1]][[1]]
hist(acc)
```

# Overview of the MOFA model

## Correlation between factors

A good sanity check is to verify that the Factors are largely uncorrelated. In MOFA there are no orthogonality constraints such as in Principal Component Analysis, but if there is a lot of correlation between Factors this suggests a poor model fit.
```{r}
plot_factor_cor(MOFAobject)
```


## Variance decomposition analysis
The most important insight that MOFA generates is the variance decomposition analysis using `plot_variance_explained`. This plot shows the percentage of variance explained by each factor in each data modality. It summarises the (latent) signal from a complex heterogeneous data set in a single figure.  

```{r}
plot_variance_explained(MOFAobject) +
  theme(
    axis.text.x = element_text(angle=25, hjust=1, vjust=1.05)
  )
```

**(Q) What insights from the data can we learn just from inspecting this plot?**


**(Q) What do you notice about promoters?**


# Characterisation of Factors

There are a few systematic strategies to characterise the molecular signal that underlies each MOFA factor:

- **Association analysis between the sample metadata and the Factor values**: function `correlate_factors_with_covariates`
- **Inspection of factor values**: functions `plot_factor` (one factor at a time) and `plot_factors` (combinations of factors)
- **Inspection of the feature weights**: functions `plot_weights` (all weights), `plot_top_weights` (only the top weights)
- **Gene set enrichment analysis on the mRNA weights**: functions `run_enrichment`, followed by `plot_enrichment`.

## Characterisation of Factor 1

### Visualisation of Factor values 

Plotting Factor 1 values and colouring cells by lineage assignment cleartly shows that this factor captures the variation that is associated with the separation between Mesoderm (positive Factor values) and non-Mesoderm cells (negative Factor values).
 
```{r}
plot_factor(MOFAobject,
  factor = 1,
  color_by = "lineage", 
  add_violin = TRUE,
  dodge = TRUE
) + scale_fill_manual(values=colors)
```

**How do we interpret the factor values?**  
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.  
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.


### Visualisation of RNA weights

The weights provide a score for each gene on each factor. Genes with no association with the factor are expected to have values close to zero, whereas genes with strong association with the factor are expected to have large absolute values. The sign of the weight indicates the direction of the effect: a positive weight indicates that the feature is more active in the cells with positive factor values, and viceversa.  

Let's plot the distribution of weights for Factor 1.
```{r, warnings=FALSE, message=FALSE}
plot_weights(MOFAobject,
  view = "RNA",
  factor = 1,
  nfeatures = 10,     # Top number of features to highlight
  scale = T           # Scale weights from -1 to 1
)
```

If you are not interested in the full distribution, but just on the top weights, you can instead do:
```{r}
plot_top_weights(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  nfeatures = 10,
  scale = T, 
  abs = T
)
```

We expect that genes with large positive weights For Factor 1 to be highlighy expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes **with positive weight**:
```{r}
genes <- c("Phlda2","Mesp1")

for (i in genes) {
  plot_factor(MOFAobject,
    factor = 1,
    dot_size = 2.5,
    group_by = "lineage",
    color_by = i
  ) %>% print
}
```


Similarly, we expect that genes with large negative weights For Factor 1 to be lowly expressed in the Mesoderm cells. If we plot Factor 1 colouring cells by gene expresion of the top genes **with negative weight**:
```{r}
genes <- c("Cldn6","Pim2")

for (i in genes) {
  plot_factor(MOFAobject,
    factor = 1,
    dot_size = 2.5,
    group_by = "lineage",
    color_by = i
  ) %>% print
}
```

## Visualisation of RNA expression patterns in the high-dimensional space

The weights are useful to identify which genes are driving each factors. After inspecting the weights it is good practice to go back to the high-dimensional space and check if the variability that MOFA captures is real.  
For example, one could generate a heatmap plot of the RNA expression for the top genes, where samples are sorted by the corresponding factor values. This is the aim of the `plot_data_heatmap` function:

```{r}
plot_data_heatmap(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`,
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE
)
```

An interesting option of `plot_data_heatmap` is to plot "denoised" observations. This is obtained by reconstructing the data using the matrix factorisation equation from MOFA:  

$$\hat{\mathbf{Y}}^m = \mathbf{W}^m\mathbf{Z}$$
where $\mathbf{W}^m$ is the weight matrix for the $m$-th view, and $\mathbf{Z}$ is the (shared) factor matrix.  
This data reconstruction step essentially removes all the variation that is not captured by the model:

```{r}
plot_data_heatmap(MOFAobject, 
  view = "RNA", 
  factor = 1, 
  denoise = TRUE,
  features = 25,
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F 
)
```

## Visualisation of DNA methylation weights

As we have done with RNA, we can also visualise the distribution of weights for the epigenetic modalities. The problem about this is that the large majority of enhancers are not well annotated and we only have the genomic coordinates for them...

```{r}
plot_weights(MOFAobject,
  view = c("Enhancers methylation"),
  factor = 1,
  nfeatures = 5,
  scale = F
)
```

## Visualisation of DNA methylation patterns in the high-dimensional space

As done with the RNA above, let's visualise in the high-dimensional space the DNA methylation variation that MOFA captures using the `plot_data_heatmap` function. Notice how noisy and sparse DNA methylation data is. 

```{r, out.width="130%"}
plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 1, 
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)
```

We will use MOFA to impute the missing values. This is based on the data reconstruction equation shown above.

```{r}
MOFAobject <- impute(MOFAobject)
```

Plot heatmap with `impute=TRUE` argument. Despite the gaussian likelihood model not being optimal for DNA methylation data, this heatmap looks much better!
```{r, out.width="130%"}
plot_data_heatmap(MOFAobject, 
  view = "Enhancers methylation", 
  factor = 1, 
  impute = TRUE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)
```

As we guessed from the variance decomposition analysis, the promoters do not display interesting signal during germ layer commitment
```{r, out.width="130%"}
plot_data_heatmap(MOFAobject, 
  view = "Promoters methylation", 
  factor = 1, 
  impute = TRUE,
  features = 25,
  annotation_samples = "lineage",
  # extra arguments passed to `pheatmap`
  show_colnames = F, cluster_cols = F, 
  annotation_colors = list("lineage"=colors),
  annotation_legend = FALSE,
  fontsize = 6
)
```


## Characterisation of Factor 2

**(Q) Your task is to provide a characterisation for Factor 2**.   

Try a similar pipeline as for Factor 1 and answer the following questions:  

- Which germ layer underlies Factor 2?  


- Can you identify mRNA markers? Validate them using the [scRNA-seq reference atlas](https://marionilab.cruk.cam.ac.uk/MouseGastrulation2018)  

- Generate a heatmap that displays the DNA methylation variation (try with and without imputation of missing values, see the function `impute`)  

## Plot combinations of Factors

This scatter plot captures in two dimensions all the variation that is required to separate the three germ layers! It corresponds to [Figure 2b in the paper](https://www.nature.com/articles/s41586-019-1825-8).  

```{r}
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "lineage",
  dot_size = 2,
  legend = TRUE
) + scale_fill_manual(values=colors)
```

We can colour this plot by the molecular measurements, for example gene expression:
```{r}
# Ectoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Pim2", 
  dot_size = 2
)

# Mesoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Mesp1", 
  dot_size = 2
)

# Endoderm marker
plot_factors(MOFAobject,
  factors = c(1,2), 
  color_by = "Foxa2", 
  dot_size = 2
)
```

# Zero-inflation in scRNA-seq data

**(Q) Extract (1) the RNA expression data from the mofa model using the `get_data` function, and (2) the MOFA RNA expression predictions using the `predict` function. Then, plot a histogram of the two matrices. What do you notice?**


# Are the changes in DNA methylation and chromatin accessibility correlated at the loci level?

Using MOFA we have identified coordinated axis of variation between the three omics (Factor1 and Factor2). However, this does not necessarily imply that the *same* features are driving the signal in all of the omics. It could be that DNA methylation changes associated with endoderm commitment (Factor 2) are driven by enhancers A,B,C whereas chromatin accessibility changes are driven by enhancers D,E,F.  
To explore this, we will correlate the weights for both epigenetc modalities. This cannot be done internally within MOFA so we have to extract the weights manually do our own downstream analysis This is done with the function `get_weights`:

```{r}
# Fetch weights
w.met <- get_weights(MOFAobject, 
  factors = c(1,2), 
  views = "Enhancers methylation", 
  scale = TRUE,
  as.data.frame = T
) %>% as.data.table

w.acc <- get_weights(MOFAobject, 
  factors = c(1,2), 
  views = "Enhancers accessibility", 
  scale = TRUE,
  as.data.frame = T
) %>% as.data.table

# Remove the met_ and acc_ prefix from the feature names
w.met[,feature:=substr(feature,5,nchar(as.character(feature)))]
w.acc[,feature:=substr(feature,5,nchar(as.character(feature)))]

# Merge weights
w.dt <- merge(
  w.met[,c("feature","factor","value")], 
  w.acc[,c("feature","factor","value")], 
  by=c("feature","factor")
)
```

Notice how the DNA methylation and chromatin accessibility weighs are highly correlated, suggesting that the variability in both epigenetic layers is highly coupled not just globally but also at the individual loci level.
```{r}
ggplot(w.dt, aes(x=value.x, y=value.y)) +
  geom_point() + 
  stat_smooth(method="lm") +
  theme_classic() +
  facet_wrap(~factor, scales="free") +
  labs(x="DNA methylation weight", y="Chromatin accessibility weight")
```


# Optional: Using pseudo-time annotations of cells with MEFISTO
In the lectures you have also learnt about extensions of MOFA to account for temporal or spatial information on the samples. You will hear more about spatial data tomorrow. Time course data at a good temporal resolution is still rare for single cell studies. In the full scNMT-seq gastrulation data we have 4 different time points, which does not provide enough temporal resolution for MEFISTO but instead we can consider using a measure of *pseudotime* and use this in MEFISTO to distinguish patterns of variation that vary smoothly along pseudotime from other non-smooth sources of variation, e.g. cell cycle. 

For this we can train a similar model on the full data from all 4 stages and simply additionally pass the pseudo-time values for each cell to MOFA. If you are interested in how this works and what other types of down-stream analysis are possible have a look at
[this tutorial](https://raw.githack.com/bioFAM/MEFISTO_tutorials/master/scnmt_mefisto_vignette.html). 

# sessionInfo

```{r}
sessionInfo()
```