aCA_XY_Processing_Analysis.Rmd

---
title: "Sex-stratified linear modelling for differential X and Y chromosome DNAme"
author: "Amy M. Inkster"
subtitle: Using GSE115508 EPIC data from placentas with and without acute chorioamnionitis
output:
  pdf_document:
    toc: yes
  html_document:
    toc: yes
    toc_float: yes
    keep_md: yes
---

## 0.0 Intro & load packages

In this script I download the data associated with the acute chorioamnionitis (aCA) dataset (GSE115508) from the Gene Expression Omnibus (GEO) website, process and normalize the X and Y chromosome DNAme data from this cohort in the method of Inkster et al. 2022, and then run a set of linear models to test for differential DNAme associated with aCA status on the X and Y chromosome.

```{r, echo=T, message=F, warning=F,results='hold'}
library(tidyverse)
library(here)

library(biobroom)
library(ewastools)
library(GEOquery)
library(lumi)
library(minfi)
library(sesameData)
library(wateRmelon)
library(cowplot)
library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)

# illumina annotation 
probeInfo <- as.data.frame(cbind(IlluminaHumanMethylationEPICanno.ilm10b4.hg19::Locations, 
                                 IlluminaHumanMethylationEPICanno.ilm10b4.hg19::Other, 
                                 IlluminaHumanMethylationEPICanno.ilm10b4.hg19::Manifest)) 
probeInfo$probeID <- rownames(probeInfo)
chrXprobes <- probeInfo %>% filter(chr == "chrX") 
chrYprobes <- probeInfo %>% filter(chr == "chrY") 

# zhou annotation 
# https://zwdzwd.github.io/InfiniumAnnotation
# https://zhouserver.research.chop.edu/InfiniumAnnotation/20180909/EPIC/EPIC.hg19.manifest.tsv.gz
zhouAnno <- sesameDataGetAnno("EPIC/EPIC.hg19.manifest.tsv.gz")
```

## 1.0 DNAme data

### 1.1 Download from GEO

Metadata is stored on the series matrix and raw data in the form of IDAT files can be downloaded from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115508 

Data and metadata parsing code are not shown in this script.

```{r, eval=F, echo=FALSE,results='hold'}
untar(tarfile = "Z:/7_ExternalData/GSE115508/GSE115508_RAW.tar",
      exdir = "Z:/7_ExternalData/GSE115508/")

idats <- list.files(path="Z:/7_ExternalData/GSE115508/", pattern=".idat.gz$", full.names=TRUE)
length(idats) # should be 158 (79*2)
sapply(idats, unzip, overwrite = TRUE)

rgset <- read.metharray.exp("Z:/7_ExternalData/GSE115508/", extended=T)
```

```{r, eval=F, echo=FALSE,results='hold'}
pDat <- getGEO("GSE115508") # download the series matrix
pDat <- pData(pDat[[1]])
pDat <- pDat %>% dplyr::rename(
  Sample_Name = title,
  GEO_Accession = geo_accession,
  Case_ID = `patient id:ch1`,
  Tissue = `tissue:ch1`,
  Group = `pathology group:ch1`,
  Sex = `fetal sex:ch1`,
  GA = `gestational age:ch1`,
  Plate = `850k plate:ch1`,
  Sentrix_ID = `850k sentrix_id:ch1`,
  Sentrix_Position = `850k sentrix_position:ch1`
)

# create a Sample_ID column (to match rgset colnames)
pDat <- pDat %>% mutate(Sample_ID = paste0(GEO_Accession, "_", Sentrix_ID, "_", Sentrix_Position))

# now select only those columns we will need in future
pDat <- pDat %>% dplyr::select(
  Sample_Name,
  Case_ID,
  Sample_ID,
  GEO_Accession,
  Tissue,
  Group,
  Sex,
  GA, 
  Plate, 
  Sentrix_ID,
  Sentrix_Position
)
```


### 1.2 Load objects

```{r,results='hold'}
pDat <- read.csv(here("GSE115508/GSE115508_pData.csv"))
rgset <- readRDS(here("GSE115508/rgset_ext.rds"))

pDat <- pDat %>% mutate(Sample_ID = paste0(Sentrix_ID, "_", Sentrix_Position)) # to match names of rgset
```

### 1.3 Subset to chorionic villous samples

In this section we are going to select only data from chorionic villi. This dataset is comprised of matched tissue (chorion, amnion, and chorionic villi) from multiple placentae.

```{r, warning=FALSE,results='hold'}
pDat <- pDat %>% filter(Tissue %in% "Chorionic villi") # 48 samps
rgset <- rgset[ , pDat$Sample_ID]
all.equal(pDat$Sample_ID, colnames(rgset)) # must be TRUE before proceeding

# now normalize dataset, also get raw betas for sex checks later with minfi
mset <- preprocessFunnorm(rgset, bgCorr = F, dyeCorr = F) 
mset_raw <- preprocessRaw(rgset) 
betas <- minfi::getBeta(mset)
```


## 2.0 Sample Quality Control

### 2.1 Check identity

Use the 56 rs built into the EPIC array to estimate sample genotypes at these loci and compare across all samples. 

```{r,results='hold'}
rs <- getSnpBeta(rgset)
genotypes <- call_genotypes(rs)

# label samples with donor ID and number of times present in data genetically
pDat <- pDat %>% mutate(Sample_Donors = enumerate_sample_donors(genotypes))
pDat <- pDat %>% group_by(Sample_Donors) %>% mutate(n_Duplicated = n()) %>% ungroup()

# index samples explicitly labelled "rvc" to easily remove prior to analysis
pDat <-  pDat %>%
  mutate(Replicate = case_when(grepl("_r", Sample_Name) ~ "Replicate")) 
```

### 2.2 Check sex (modified ewastools method)

In this section we evaluate the sex of all samples using a modified version of the ewastools check_sex() function. This code was modified to work with minfi objects to avoid parsing the IDATs into multiple data objects.

```{r,results='hold'}
# ALL CREDIT FOR CODE BELOW IS ATTRIBUTED TO EWASTOOLS PACKAGE AUTHORS (Heiss & Just)
# For ewastools package info please visit https://github.com/hhhh5/ewastools

# Original ewastools paper:
# Heiss, J., Just, A. Identifying mislabeled and contaminated DNA methylation
# microarray data: an extended quality control toolset with examples from GEO. 
# Clin Epigenet 10, 73 (2018). https://doi.org/10.1186/s13148-018-0504-1

# function equivalent to ewastools::check_sex(), pull sample XY intensity norm to auto
mset_check_sex = function(mset){
  
  if(!"tidyverse" %in% installed.packages()) {
    install.packages("tidyverse")
  }
  
  
  if(!"BiocManager" %in% installed.packages()) {
    install.packages("BiocManager")
  }
  
  if(!"minfi" %in% installed.packages()) {
    BiocManager::install("minfi")
  }
  
  require(tidyverse)
  require(minfi)
  
  # define M, U, and probe anno objects
  M = minfi::getMeth(mset) %>% as.data.frame()
  U = minfi::getUnmeth(mset) %>% as.data.frame()
  anno = minfi::getAnnotation(mset) %>% as.data.frame() # annotation with "chr" column, entries chr1 - chrY, probeIDs as rownames
	
  
  # select allosomal probes
  chrX = dimnames(anno[anno$chr=='chrX',])[[1]]
  chrY = dimnames(anno[anno$chr=='chrY',])[[1]]
  
  # compute the total intensities
  chrX = colMeans(M[chrX,,drop=FALSE]+U[chrX,,drop=FALSE],na.rm=TRUE)
  chrY = colMeans(M[chrY,,drop=FALSE]+U[chrY,,drop=FALSE],na.rm=TRUE)
  
  # compute the average total intensity across all autosomal probes
  autosomes = dimnames(anno[!anno$chr %in% c("chrX","chrY"),])[[1]]
  autosomes = colMeans(M[autosomes,,drop=FALSE]+U[autosomes,,drop=FALSE],na.rm=TRUE)
  
  # normalize total intensities
  chrX_nonorm = chrX
  chrY_nonorm = chrY
  
  chrX = chrX/autosomes
  chrY = chrY/autosomes
  
  # define a df with the output
  xy_int = data.frame(Sample_ID = dimnames(mset)[[2]],
                      Array = annotation(mset)[[1]],
                      X = chrX,
                      Y = chrY)
  
	return((xy_int))
}


# equivalent to ewastools::predict_sex(), calls sex based on XY intensity values 
# by calculating robust hodges-lehmann estimator based on known male/female samples
mset_predict_sex = function(sex_int,male,female){

	# compute the robust Hodges-Lehmann estimator for the total intensity for X chr probes
	cutX = outer(sex_int$X[male],sex_int$X[female],"+")
	cutX = median(cutX)/2

	# ... likewise for Y chr probes
	cutY = outer(sex_int$Y[male],sex_int$Y[female],"+")
	cutY = median(cutY)/2

	# Prediction based on in which quadrant (cutX/cutY) samples fall
	DNAme_Sex = rep(NA,times=length(sex_int$X))
	DNAme_Sex[sex_int$X>=cutX & sex_int$Y<=cutY] =  "F"
	DNAme_Sex[sex_int$X<=cutX & sex_int$Y>=cutY] =  "M"
	factor(DNAme_Sex,levels=c("M","F"),labels=c("M","F"))
	
	sex_preds = cbind(sex_int, DNAme_Sex)
	
	results = list("Sex_Preds" = sex_preds, "cutX" = cutX, "cutY" = cutY) 
	
	return(results)
	
}
```


```{r,results='hold'}
# calculate normalized X and Y fluo intensity per sample & predict sexes
# need logical vectors of male and females in dataset
male <- pDat$Sex == "M"
female <- pDat$Sex == "F"

sex_list <- mset_check_sex(mset_raw)
sex_list <- mset_predict_sex(sex_list, male, female)
head(sex_list$Sex_Preds) # result: sex estimation for each sample with chrX & chrY norm'd fluo intensity columns attached

# attach X and Y normalized intensities plus predicted sexes to metadata
pDat <- pDat %>% left_join(sex_list$Sex_Preds, by="Sample_ID")

# plot coloring by REPORTED SEX to see how accurate the metadata is
# no samples of concern for sex chr complement
pDat %>%
  ggplot(aes(x=X, y=Y, color=Sex)) +
  geom_point() +
  geom_vline(xintercept = sex_list$cutX) +
  geom_hline(yintercept = sex_list$cutY) +
  labs(title="chrX vs chrY normalized fluorescence intensity",
       color="Reported Sex",
       x="Mean chrX / mean Autosomal Intensity",
       y="Mean chrY / mean Autosomal Intensity") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5, face="italic"), 
        legend.position = "none",
        strip.text = element_text(size=12),
        strip.background = element_rect(fill="#DDDDDD", color= "#DDDDDD"),
        panel.border = element_rect(color= "#EDEDED", fill=NA))
```

## 3.0 Probe filtering

### 3.1 Detection p value & beadcount

Defining poor quality probes herein those with a detection P value > 0.01 in >5% of samples.
*This is a step that has to be done differently for the Y chromosome - i.e. don't evaluate detP in XX cases!* 

Also, remove probes with a beadcount < 3 in >5% of samples. No need to sex-stratify beadcount failure calling.

```{r,results='hold'}
# create a 'failed probe' matrix to filter
detp <- minfi::detectionP(rgset)
bc <- beadcount(rgset)

# for female Y chromosome, set detp to 0 (they are not failing so not > 0.05)
detp[rownames(detp) %in% chrYprobes$probeID, female] <- 0

# create a failed probes matrix
 # if detP > 1 OR bc<3 (NA) OR missing value, sum per cell will be > 0
rawbeta <- getBeta(rgset)

fp <- (detp>0.01) + is.na(bc) + is.na(rawbeta)
fp <- fp > 0

fail_probes <- fp[ rowSums(fp) > ncol(fp)*0.05 ,]
dim(fail_probes) 

# remove failed probes in > 5% samples
dim(betas) # 865859        
betas <- betas[!(rownames(betas) %in% rownames(fail_probes)),]
dim(betas) # 855837    

# identify (& remove) samples with > 5% failed probes
table(colSums(fp) > nrow(fp)*0.05) # 0 failed
```

### 3.2 Polymorphic & cross-hybridizing probes

No separate treatment required for X and Y data versus autosomes. 

```{r,results='hold'}
# remove probes
zhou_maskgeneral <- zhouAnno %>% filter(MASK_general == TRUE)
dim(zhou_maskgeneral) # 99360 that need to be removed

table(chrXprobes$probeID %in% zhou_maskgeneral$probeID) # 2223 X probes to remove
table(chrYprobes$probeID %in% zhou_maskgeneral$probeID) # 217 Y probes to remove

dim(betas) # 855837     
betas <- betas[!(rownames(betas) %in% zhou_maskgeneral$probeID), ]
dim(betas) # 758142     
```

### 3.4 Non-variable probes

In the context of most processing pipelines, now would be the time to remove non-variable probes. For statistical validity with later multiple test correction, we need to do this with reference to an external database or a priori biological knowledge. 

Unfortunately, a resource for EPIC placental data that includes the X and Y does not exist yet, so we are limited in our ability to do so. Thus we will leave in the non-variable due to technical limitations, but if we were going to do this Y variability should be assessed in males only! (Call variability and then set female Y values to 0 as we did in the detP section).

## 4.0 Analysis

### 4.1 Remove replicates

```{r,results='hold'}
pDat_lm <- pDat %>% filter(is.na(Replicate)) # keep everything that is not a replicate
betas <- betas[,pDat_lm$Sample_ID]

pDat_f <- pDat_lm %>% filter(Sex == "F")
pDat_m <- pDat_lm %>% filter(Sex == "M")

table(pDat_f$Group) # 11aCA vs 9not
table(pDat_m$Group) # 11aCA vs 13not
```

### 4.2 Select XY probes

We are going to run 3 linear models, Female X, Male X, and Male Y. The data and metadata will need to be separated into these subsets and then subjected to linear modelling.

```{r,results='hold'}
mVals <- beta2m(betas)

m_xf <- mVals[rownames(mVals) %in% chrXprobes$probeID, pDat_f$Sample_ID]
m_xm <- mVals[rownames(mVals) %in% chrXprobes$probeID, pDat_m$Sample_ID]
m_ym <- mVals[rownames(mVals) %in% chrYprobes$probeID, pDat_m$Sample_ID]

dim(m_xf) # 16344 20
dim(m_xm) # 16344 24
dim(m_ym) # 314 24
```

### 4.3 Linear model

```{r,results='hold'}
# specify design matrix (one for each sex)
mod_f <- model.matrix(~ Group + GA, pDat_f, row.names=T)
mod_m <- model.matrix(~ Group + GA, pDat_m, row.names=T)

# female X
fit_xf <- lmFit(m_xf, mod_f) %>% eBayes() 
td_xf <- tidy.MArrayLM(fit_xf) %>%
  dplyr::rename(probeID = gene) %>% 
  group_by(term) %>%
  mutate(fdr = p.adjust(p.value, method="fdr")) %>%
  ungroup() %>%
  as.data.frame()

td_xf %>% filter(fdr < 0.05) # one hit

# male X
fit_xm <- lmFit(m_xm, mod_m) %>% eBayes() 
td_xm <- tidy.MArrayLM(fit_xm) %>%
  dplyr::rename(probeID = gene) %>% 
  group_by(term) %>%
  mutate(fdr = p.adjust(p.value, method="fdr")) %>%
  ungroup() %>%
  as.data.frame()

td_xm %>% filter(fdr < 0.05) # 0


# male Y
fit_ym <- lmFit(m_ym, mod_m) %>% eBayes() 
td_ym <- tidy.MArrayLM(fit_ym) %>%
  dplyr::rename(probeID = gene) %>% 
  group_by(term) %>%
  mutate(fdr = p.adjust(p.value, method="fdr")) %>%
  ungroup() %>%
  as.data.frame()

td_ym %>% filter(fdr < 0.05) # 0
```


## 5.0 Plot hit

One CpG was differentially methylated in females on the X chromosome by aCA status. Boxplot to visualize this below, this CpG is in the gene body of the NKAP gene (NFkB activating protein), and is more highly methylated in non-chorioamnionitis placentae.

### 5.1 Boxplot

```{r,results='hold'}
td_xf %>% filter(fdr < 0.05)

cg22880666 <- betas[rownames(betas) %in% c("cg22880666"), pDat_f$Sample_ID, drop=F] %>% 
  t() %>%
  as.data.frame() %>%
  rownames_to_column(., var="Sample_ID") %>%
  left_join(pDat_f)

cg22880666 %>% 
  group_by(Group) %>%
  summarize(mean=mean(cg22880666))

plot_xHit <- cg22880666 %>%
  mutate(Group = case_when(Group == "chorioamnionitis" ~ "aCA",
                           Group == "non_chorioamnionitis" ~ "non-aCA")) %>%
  
  ggplot(aes(x=Group, y=cg22880666, color=Group, fill=Group)) +
  geom_boxplot(outlier.shape=NA, alpha=0.5) +
  geom_point(position=position_jitter(width=0.05)) + # position_jitterdodge(dodge.width = 0.2)) +
  ggtitle("Female X: cg22880666") +
  ylab("DNAme \u03b2 value") +
  theme_classic(base_size=16) +

  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5, face="italic"), 
        legend.position = "none",
        strip.text = element_text(face="bold", size=12),
        panel.grid = element_blank()) +
  
  scale_fill_manual(values=c("#882255", "#cc6677")) +
  scale_color_manual(values=c("#882255", "#cc6677"))
  

chrXprobes %>% filter(probeID %in% c("cg22880666"))
```

### 5.2 Manhattan plots

```{r,results='hold'}
# manhattan plot
plot_manF <- td_xf %>% 
  left_join(probeInfo %>% dplyr::select(probeID, pos)) %>%
  ggplot(aes(x=pos, y=-log(p.value))) +
  geom_point(alpha=0.2) +
  geom_hline(yintercept=-log(0.05/nrow(td_xf)), color="#BBBBBB") +
  ylab("-log(p Female)") +
  ggtitle("X Chromosome") +
  ylim(c(0,15)) +
  theme_classic(base_size=16) +

 #  theme_light() +
  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5, face="italic"), 
        legend.position = "none",
        strip.text = element_text(face="bold", size=12),
        panel.grid = element_blank(),
        axis.text.x = element_blank(),
        axis.title.x = element_blank(),
        axis.line.x = element_blank(),
        axis.ticks.x = element_blank())
  
plot_manM <- td_xm %>%
  left_join(probeInfo %>% dplyr::select(probeID, pos)) %>%
  ggplot(aes(x=pos, y=-log(p.value))) +
  geom_point(alpha=0.2) +
  geom_hline(yintercept=-log(0.05/nrow(td_xm)),  color="#BBBBBB") +
  scale_y_reverse(lim=c(15,0)) +
  xlab("hg19 coordinate") +
  ylab("-log(p Male)") +
  theme_classic(base_size=16) +

 #  theme_light() +
  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5, face="italic"), 
        legend.position = "none",
        strip.text = element_text(face="bold", size=12),
        panel.grid = element_blank())

plot_manhattan_x <- plot_grid(plot_manF, plot_manM, ncol=1, nrow=2)


plot_manhattan_y <- td_ym %>%
  left_join(probeInfo %>% dplyr::select(probeID, pos)) %>%

  ggplot(aes(x=pos, y=-log(p.value))) +
  geom_point(alpha=0.2) +
  geom_hline(yintercept=-log(0.05/nrow(td_ym)), color="#BBBBBB") +
  ylab("-log(p Male)") +
  xlab("hg19 coordinate") +
  ggtitle("Y Chromosome") +
  ylim(c(0,15)) +
  theme_classic(base_size=16) +

 #  theme_light() +
  theme(plot.title = element_text(hjust=0.5),
        plot.subtitle = element_text(hjust=0.5, face="italic"), 
        legend.position = "none",
        strip.text = element_text(face="bold", size=12),
        panel.grid = element_blank())

bottom <- plot_grid(plot_manhattan_y, plot_xHit, ncol=2, nrow=1, axis="b", align="v", labels=c("B.", "C."), label_size = 12)
```

### 5.3 Manuscript figure

```{r, fig.height=8, fig.width=10,results='hold'}
fig_aCA_composite <- plot_grid(plot_manhattan_x, bottom, nrow=2, ncol=1, labels=c("A."), rel_heights=c(1,0.8), label_size=12)
fig_aCA_composite
```



```{r,echo=FALSE,results='hide'}

# MANUSCRIPT FIGURE 
grDevices::cairo_pdf("Z:/Amy/Project_XY/03_Figures/fig4_aCA.pdf", width = 10, height=8)
fig_aCA_composite
dev.off()

```





```{r,echo=F, results='hide'}
# demographics table - hide this from markdown script
pDat_lm %>%
  tableone::CreateTableOne(
    vars=c(
      "Group",
      "GA"),
    
    strata = c("Sex"),
    data=.)
```

```{r parse local rgset to get extended, echo=F, eval=F,results='hold'}

masterSS <- readxl::read_xlsx("Z:/ROBLAB6 InfiniumSequenom/Master_Sample_Sheet.xlsx", col_names = T)
batch7 <-  masterSS %>% filter(Batch == "Batch 7", Platform == "850k")

pDat <- read.csv("Z:/7_ExternalData/GSE115508/GSE115508_pData.csv")
batch7[!(batch7$Sample_Name %in% pDat$Sample_Name), ] # PL31_amc is not in final GEO dataset
batch7 <- batch7 %>% filter(!(Sample_Name == "PL31_amc")) # remove PL31_amc

batch7 <- batch7 %>% mutate(Basename_2 = paste0("Z:/ROBLAB6 InfiniumSequenom/EPIC Raw data/Batch7_rescan/",
                                               Sentrix_ID, "/",
                                               Sentrix_ID, "_", Sentrix_Position))

length(unique(batch7$Basename)) # 78, i.e. 1/79 is duplicate :(
length(unique(batch7$Basename_2)) # 79 - fixed!

batch7[duplicated(batch7$Basename),] # PL29_amc sample is the affected one
batch7 <- batch7 %>% filter(!(Sample_Name == "PL29_amc")) # remove PL29_amc, it is likely chorion & not on geo

rgset_ext <- read.metharray(basenames=batch7$Basename_2, extended = T)
saveRDS(rgset_ext, "Z:/Amy/Project_XY/XX_Data_aCA/rgset_ext.rds")

```


## 6.0 Session Info

```{r,results='hold'}
sessionInfo()
```