MO_AKAP18_Bio03_rep_v2_gitVersion.Rmd

---
title: "Smith et.al. 2018 PNAS 115 (49) E11465-E11474"
output: github_document
---

Data analysis for TurboID samples from Scott Lab. AKAP18 mutants: P, a nuclear retention mutant that stop translocation from nuclear to cytoplasm; and a NLS that stay in the cytoplasm, were fused to TurboID. The fusion proteins were overexpressed in in vitro cultured cells in the present of biotin for proximal labeling of possible interacting proteins. TurboID was also overexpressed as control. 

For biological replicates, there are 3 control samples, 4 nls smaples and 4 p samples. 

```{r, message=FALSE}
library(tidyverse)
library(stringr)
library(ggthemes)

# for joy plot
library(ggridges)
# for labeling of geom_point, which repel the label when they are overlapping
library(ggrepel)
# for Venn diagram
library(VennDiagram)
```
Before analyzing the result, I need to tidy up the data, i.e., select the columns needed for the analysis, filter the contaminations (unwanted proteins). 

The table contains 2494, 138, rows and columns, respectively. Only the columns containing gene names, protein expression, and contaminations are required. There are three columns containing the contamination information. They can be combined to filter out contaminations, then the columns can also be removed because they will not be used again.  

```{r}
raw.PG <- read.delim("../../proteinGroups.txt", header = TRUE, stringsAsFactors = FALSE)

clean.PG <- raw.PG %>% 
  dplyr::select(Protein.names, 
                Gene.names, 
                Fasta.headers, 
                Majority.protein.IDs,
                starts_with("LFQ"),
                Only.identified.by.site, 
                Reverse, 
                Potential.contaminant) %>%
  # unite the contamination columns for easier filtering
  unite(con,    
        Only.identified.by.site, Reverse, Potential.contaminant, sep = "") %>%
  filter(con == "") %>% 
  dplyr::select(-con)
```

For the analysis, I need to log transform the data, check the distribution and normalize. Normalization is done using Z-scores transformation.

```{r}
LFQ <- clean.PG %>% dplyr::select(starts_with("LFQ"))  # extract the numeric columns
# log transform
log.LFQ <- apply(LFQ, 2, log2)
# replace the -Inf 
log.LFQ <- as.data.frame(apply(log.LFQ, 2, function(x) ifelse(is.infinite(x), NA, x)), stringsAsFactors = FALSE)
# rename the columns
n <- str_locate(names(log.LFQ)[1], "LFQ.intensity.")[2]
column.names <- names(log.LFQ)
column.names <- substr(column.names, n + 1, nchar(column.names))
colnames(log.LFQ) <- column.names

# check distribution
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
  geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples))
```

```{r}
# Normalization, I have also tried to compared the peak shape between z-score and MAD normalization,
# but i don't see any obvious difference between the two method, so just do z-scores 
log.LFQ <- apply(log.LFQ, 2, scale, center = TRUE, scale = TRUE)
log.LFQ <- data.frame(log.LFQ, stringsAsFactors = FALSE)

# check distribution after transformation
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
  geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples)) + 
  xlab("Normalized LFQ")
```

Based on the following scatterplots, reproducibility is not bad. The controls look significantly different from the fusion proteins, which is expected. Counting the number of quantified proteins per experiment revealed taht p.2 has a low number that the other replicates. So I will exclude it for further analysis. 

```{r}
# check how well the replicates are
pairs(log.LFQ, pch = ".")

# check the number of quantified proteins for each sample
apply(log.LFQ, 2, function(x) sum(!is.na(x)))

# remove p.2
log.LFQ$p.2 <- NULL
```

I count the missing data per gene for each treatment, i.e., control, P- and NLS- mutants. This will impact the filtering and p-value adjustment downstream. The missing values are imputed using a random number drawn from a distribution with mean at -2 and 0.1 s.d. This assumes that the missing data is due to the signal being under the machine detection threshold. 


```{r}
# Counting NAs
sample.names <- c("ctrl", "nls", "p")

count.NAs <- function(x) {
  # separate the samples here 
  sep.sample <- log.LFQ[, grepl(pattern = x, x = names(log.LFQ)), drop = FALSE]
  # get length for calculating %
  len.sample <- length(sep.sample)
  # count NAs and calculate the percentage of missing data
  sep.sample$NAs <- apply(sep.sample, 1, 
                      function(x) sum(is.na(x)))
  sep.sample$percent.NAs <- sep.sample$NAs / len.sample
  # rename the column
  colnames(sep.sample)[as.integer(length(sep.sample))] <- str_c(x, ".percent.NAs")
  # keep the last column only
  sep.sample <- sep.sample[, as.integer(length(sep.sample)), drop = FALSE]
  
  return(sep.sample)
}

NA.counts <- lapply(sample.names, count.NAs)
NA.counts <- Reduce(x = NA.counts, 
                    f = function(x,y) cbind(x,y))
```


```{r}
# Imputation of missing values
# because it is z-score transformated, so the mean will be 0 and sd will be 1
data.size <- dim(log.LFQ)
# make the random distribution of values for imputation. Downshift 2 and sd = 0.1
set.seed(576)
impute.values <- rnorm(n = data.size[1] * data.size[2],
                       mean = -2, sd = 0.1)
# impute
set.seed(576)
log.LFQ.imputed <- data.frame(apply(log.LFQ, 2, 
                                    function(x) ifelse(is.na(x), sample(impute.values), x)),
                              stringsAsFactors = FALSE)
# rename columns
colnames(log.LFQ.imputed) <- str_c("Imputed.", names(log.LFQ.imputed))
```

t-test is used to identify the proteins that are significantly enriched by the fusion proteins (P or NLS) compared to the control. The p-values are adjusted using "fdr" method after removing rows with too many missing values. Results are visualized using volcano plots and venn diagram. 

```{r}
# write an automated function for t-test, in case more samples are added in the future

# the sample names
treatment <- c("nls", 
               "\\.p\\.")
ctrl <- "ctrl"

log.LFQ.imputed <- log.LFQ.imputed

auto.ttest <- function(x) {   # input x will be a vector containing the treatment names, so treatment here
  # extracdt the treatment sample according to the sample vector from above
  treatment <- log.LFQ.imputed[, grepl(pattern = x, x = names(log.LFQ.imputed)), drop = FALSE]
  # the rest will be anything, but the sample of interest
  ctrl <- log.LFQ.imputed[, grepl(pattern = ctrl, x = names(log.LFQ.imputed)), drop = FALSE]
  
  # get the length of the sample and ctrl, this will be used for indexing later
  len.t <- length(treatment)  # number of columns in the treatment
  len.c <- length(ctrl) # number of columns in the control
  
  # do the t.test
  df.for.ttest <- cbind(ctrl, treatment)
  len.df <- as.integer(length(df.for.ttest))
  
  df.for.ttest$Test.results <- NA
  df.for.ttest$Sample.mean <- NA
  df.for.ttest$Ctrl.mean <- NA
  df.for.ttest$Difference <- NA
  df.for.ttest$pvalue <- NA
  
  df.for.ttest$Test.results <- 
      apply(df.for.ttest, 
            1, 
            function(x) t.test(x = x[as.integer(len.c+1):as.integer(len.c+len.t)],  # x is treatment
                               y = x[1:as.integer(len.c)], var.equal = TRUE)[c(3,5)])
  
  df.for.ttest$pvalue <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1, 
                               function(x) unlist(x[[1]][1]))
  df.for.ttest$Sample.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1, 
                                    function(x) unlist(x[[1]][[2]][1]))
  df.for.ttest$Ctrl.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1, 
                                    function(x) unlist(x[[1]][[2]][2]))
  df.for.ttest$Difference <- df.for.ttest$Sample.mean - df.for.ttest$Ctrl.mean
  
  results <- df.for.ttest[, c("Sample.mean", "Ctrl.mean", "Difference", "pvalue")]
  colnames(results) <- str_c(names(results), x)
  
  
  return(results)
}

# apply the function
ttest.result.list <- lapply(treatment, auto.ttest)
ttest.results <- Reduce(x = ttest.result.list, f = function(x,y) cbind(x,y))

ttest.results <- ttest.results[, c(2,5,7,8,1,3,4)]
colnames(ttest.results) <- c("Ctrl.means", "p.means", "p.differences", "p.pvalues",
                             "nls.means", "nls.differences", "nls.pvalues")
```


```{r}
# tidy up the resulting data from the t-test
p.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(2:4)], NA.counts)
p.data <- p.data[p.data$p.percent.NAs < 0.6, ]
p.data$p.adjust.p <- p.adjust(p.data$p.pvalues)

nls.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(5:7)], NA.counts)
nls.data <- nls.data[nls.data$nls.percent.NAs < 0.6, ]
nls.data$nls.adjust.p <- p.adjust(nls.data$nls.pvalues)
```


```{r}
# find 0.05 fdr for each
p.05 <- p.data[p.data$p.adjust.p > 0.05,]
p.05 <- min(p.05$p.pvalues)

nls.05 <- nls.data[nls.data$nls.adjust.p > 0.05,]
nls.05 <- min(nls.05$nls.pvalues)

both.plot <- ggplot() +
  # p
  geom_point(data = p.data[p.data$p.pvalues > p.05,], 
             mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
  geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
  # above fdr
  geom_point(data = p.data[p.data$p.pvalues <= p.05,], 
             mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
  
  # nls mutant, below fdr
  geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
             mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
  geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) + 
  # above fdr
  geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
             mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
  
  #coord_cartesian(xlim = c(0.5, 4.5)) +
  theme_classic()
both.plot
```

```{r}
# i want to see whether I can plot both data into 1 plot
p.plot <- ggplot() +
  # p mutant, below fdr
  geom_point(data = p.data[p.data$p.pvalues > p.05,], 
             mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
  geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
  # above fdr
  geom_point(data = p.data[p.data$p.pvalues <= p.05,], 
             mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
  
  coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
  xlab("P Mutant Difference (Z-score Normalized log2(P - Ctrl)") +
  theme_classic()
p.plot
```


```{r}
nls.plot <- ggplot() +
  # nls mutant, below fdr
  geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
             mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
  geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) + 
  # above fdr
  geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
             mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
  
  coord_cartesian(xlim = c(-0.3, -4.6), ylim = c(0, 18)) +
  xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
  theme_classic()
#nls.plot

nls.plot2 <- ggplot() +
  # nls mutant, below fdr
  geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
             mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.1) +
  geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) + 
  # above fdr
  geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
             mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
  
  coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
  xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
  theme_classic()
nls.plot2
```

```{r}
venn <- draw.pairwise.venn(57 + 24, 40 + 24, 24,
                           fill = c("blue", "red"),
                           alpha = c(0.5, 0.5), 
                           category = c("NLS", "P"))
venn
```

```r
#pdf(file = "fdr_05.pdf")
#grid.draw(venn)
#dev.off()
venn
```

```
## (polygon[GRID.polygon.384], polygon[GRID.polygon.385], polygon[GRID.polygon.386], polygon[GRID.polygon.387], text[GRID.text.388], text[GRID.text.389], text[GRID.text.390], text[GRID.text.391], text[GRID.text.392])
```

Head(output.table)

```r
output <- NULL
output <- cbind(clean.PG[, 1:3], ttest.results[, 1:4])
output <- left_join(output, p.data[, c(2,9)], by = "Fasta.headers")

output <- cbind(output, ttest.results[, 5:7])
output <- left_join(output, nls.data[, c(2,9)], by = "Fasta.headers")

output <- cbind(output, clean.PG[,5:15])
output <- cbind(output, log.LFQ)
output <- cbind(output, log.LFQ.imputed)

output <- data.frame(apply(output, 2, function(x) ifelse(is.na(x), "", x)))

#write.table(output, "AKAP18_p_nls_BioID.txt", 
#            append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)
head(output)
```