-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMO_AKAP18_Bio03_rep_v2_gitVersion.Rmd
329 lines (265 loc) · 12.6 KB
/
MO_AKAP18_Bio03_rep_v2_gitVersion.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
---
title: "Smith et.al. 2018 PNAS 115 (49) E11465-E11474"
output: github_document
---
Data analysis for TurboID samples from Scott Lab. AKAP18 mutants: P, a nuclear retention mutant that stop translocation from nuclear to cytoplasm; and a NLS that stay in the cytoplasm, were fused to TurboID. The fusion proteins were overexpressed in in vitro cultured cells in the present of biotin for proximal labeling of possible interacting proteins. TurboID was also overexpressed as control.
For biological replicates, there are 3 control samples, 4 nls smaples and 4 p samples.
```{r, message=FALSE}
library(tidyverse)
library(stringr)
library(ggthemes)
# for joy plot
library(ggridges)
# for labeling of geom_point, which repel the label when they are overlapping
library(ggrepel)
# for Venn diagram
library(VennDiagram)
```
Before analyzing the result, I need to tidy up the data, i.e., select the columns needed for the analysis, filter the contaminations (unwanted proteins).
The table contains 2494, 138, rows and columns, respectively. Only the columns containing gene names, protein expression, and contaminations are required. There are three columns containing the contamination information. They can be combined to filter out contaminations, then the columns can also be removed because they will not be used again.
```{r}
raw.PG <- read.delim("../../proteinGroups.txt", header = TRUE, stringsAsFactors = FALSE)
clean.PG <- raw.PG %>%
dplyr::select(Protein.names,
Gene.names,
Fasta.headers,
Majority.protein.IDs,
starts_with("LFQ"),
Only.identified.by.site,
Reverse,
Potential.contaminant) %>%
# unite the contamination columns for easier filtering
unite(con,
Only.identified.by.site, Reverse, Potential.contaminant, sep = "") %>%
filter(con == "") %>%
dplyr::select(-con)
```
For the analysis, I need to log transform the data, check the distribution and normalize. Normalization is done using Z-scores transformation.
```{r}
LFQ <- clean.PG %>% dplyr::select(starts_with("LFQ")) # extract the numeric columns
# log transform
log.LFQ <- apply(LFQ, 2, log2)
# replace the -Inf
log.LFQ <- as.data.frame(apply(log.LFQ, 2, function(x) ifelse(is.infinite(x), NA, x)), stringsAsFactors = FALSE)
# rename the columns
n <- str_locate(names(log.LFQ)[1], "LFQ.intensity.")[2]
column.names <- names(log.LFQ)
column.names <- substr(column.names, n + 1, nchar(column.names))
colnames(log.LFQ) <- column.names
# check distribution
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples))
```
```{r}
# Normalization, I have also tried to compared the peak shape between z-score and MAD normalization,
# but i don't see any obvious difference between the two method, so just do z-scores
log.LFQ <- apply(log.LFQ, 2, scale, center = TRUE, scale = TRUE)
log.LFQ <- data.frame(log.LFQ, stringsAsFactors = FALSE)
# check distribution after transformation
log.LFQ.ridge <- log.LFQ %>% gather(key = "Samples", value = "Log2.LFQ")
ggplot(log.LFQ.ridge) +
geom_density_ridges(mapping = aes(x = Log2.LFQ, y = Samples)) +
xlab("Normalized LFQ")
```
Based on the following scatterplots, reproducibility is not bad. The controls look significantly different from the fusion proteins, which is expected. Counting the number of quantified proteins per experiment revealed taht p.2 has a low number that the other replicates. So I will exclude it for further analysis.
```{r}
# check how well the replicates are
pairs(log.LFQ, pch = ".")
# check the number of quantified proteins for each sample
apply(log.LFQ, 2, function(x) sum(!is.na(x)))
# remove p.2
log.LFQ$p.2 <- NULL
```
I count the missing data per gene for each treatment, i.e., control, P- and NLS- mutants. This will impact the filtering and p-value adjustment downstream. The missing values are imputed using a random number drawn from a distribution with mean at -2 and 0.1 s.d. This assumes that the missing data is due to the signal being under the machine detection threshold.
```{r}
# Counting NAs
sample.names <- c("ctrl", "nls", "p")
count.NAs <- function(x) {
# separate the samples here
sep.sample <- log.LFQ[, grepl(pattern = x, x = names(log.LFQ)), drop = FALSE]
# get length for calculating %
len.sample <- length(sep.sample)
# count NAs and calculate the percentage of missing data
sep.sample$NAs <- apply(sep.sample, 1,
function(x) sum(is.na(x)))
sep.sample$percent.NAs <- sep.sample$NAs / len.sample
# rename the column
colnames(sep.sample)[as.integer(length(sep.sample))] <- str_c(x, ".percent.NAs")
# keep the last column only
sep.sample <- sep.sample[, as.integer(length(sep.sample)), drop = FALSE]
return(sep.sample)
}
NA.counts <- lapply(sample.names, count.NAs)
NA.counts <- Reduce(x = NA.counts,
f = function(x,y) cbind(x,y))
```
```{r}
# Imputation of missing values
# because it is z-score transformated, so the mean will be 0 and sd will be 1
data.size <- dim(log.LFQ)
# make the random distribution of values for imputation. Downshift 2 and sd = 0.1
set.seed(576)
impute.values <- rnorm(n = data.size[1] * data.size[2],
mean = -2, sd = 0.1)
# impute
set.seed(576)
log.LFQ.imputed <- data.frame(apply(log.LFQ, 2,
function(x) ifelse(is.na(x), sample(impute.values), x)),
stringsAsFactors = FALSE)
# rename columns
colnames(log.LFQ.imputed) <- str_c("Imputed.", names(log.LFQ.imputed))
```
t-test is used to identify the proteins that are significantly enriched by the fusion proteins (P or NLS) compared to the control. The p-values are adjusted using "fdr" method after removing rows with too many missing values. Results are visualized using volcano plots and venn diagram.
```{r}
# write an automated function for t-test, in case more samples are added in the future
# the sample names
treatment <- c("nls",
"\\.p\\.")
ctrl <- "ctrl"
log.LFQ.imputed <- log.LFQ.imputed
auto.ttest <- function(x) { # input x will be a vector containing the treatment names, so treatment here
# extracdt the treatment sample according to the sample vector from above
treatment <- log.LFQ.imputed[, grepl(pattern = x, x = names(log.LFQ.imputed)), drop = FALSE]
# the rest will be anything, but the sample of interest
ctrl <- log.LFQ.imputed[, grepl(pattern = ctrl, x = names(log.LFQ.imputed)), drop = FALSE]
# get the length of the sample and ctrl, this will be used for indexing later
len.t <- length(treatment) # number of columns in the treatment
len.c <- length(ctrl) # number of columns in the control
# do the t.test
df.for.ttest <- cbind(ctrl, treatment)
len.df <- as.integer(length(df.for.ttest))
df.for.ttest$Test.results <- NA
df.for.ttest$Sample.mean <- NA
df.for.ttest$Ctrl.mean <- NA
df.for.ttest$Difference <- NA
df.for.ttest$pvalue <- NA
df.for.ttest$Test.results <-
apply(df.for.ttest,
1,
function(x) t.test(x = x[as.integer(len.c+1):as.integer(len.c+len.t)], # x is treatment
y = x[1:as.integer(len.c)], var.equal = TRUE)[c(3,5)])
df.for.ttest$pvalue <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][1]))
df.for.ttest$Sample.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][[2]][1]))
df.for.ttest$Ctrl.mean <- apply(df.for.ttest[,"Test.results", drop = FALSE], 1,
function(x) unlist(x[[1]][[2]][2]))
df.for.ttest$Difference <- df.for.ttest$Sample.mean - df.for.ttest$Ctrl.mean
results <- df.for.ttest[, c("Sample.mean", "Ctrl.mean", "Difference", "pvalue")]
colnames(results) <- str_c(names(results), x)
return(results)
}
# apply the function
ttest.result.list <- lapply(treatment, auto.ttest)
ttest.results <- Reduce(x = ttest.result.list, f = function(x,y) cbind(x,y))
ttest.results <- ttest.results[, c(2,5,7,8,1,3,4)]
colnames(ttest.results) <- c("Ctrl.means", "p.means", "p.differences", "p.pvalues",
"nls.means", "nls.differences", "nls.pvalues")
```
```{r}
# tidy up the resulting data from the t-test
p.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(2:4)], NA.counts)
p.data <- p.data[p.data$p.percent.NAs < 0.6, ]
p.data$p.adjust.p <- p.adjust(p.data$p.pvalues)
nls.data <- cbind(clean.PG[,c(2,3)], ttest.results[, c(5:7)], NA.counts)
nls.data <- nls.data[nls.data$nls.percent.NAs < 0.6, ]
nls.data$nls.adjust.p <- p.adjust(nls.data$nls.pvalues)
```
```{r}
# find 0.05 fdr for each
p.05 <- p.data[p.data$p.adjust.p > 0.05,]
p.05 <- min(p.05$p.pvalues)
nls.05 <- nls.data[nls.data$nls.adjust.p > 0.05,]
nls.05 <- min(nls.05$nls.pvalues)
both.plot <- ggplot() +
# p
geom_point(data = p.data[p.data$p.pvalues > p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = p.data[p.data$p.pvalues <= p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
#coord_cartesian(xlim = c(0.5, 4.5)) +
theme_classic()
both.plot
```
```{r}
# i want to see whether I can plot both data into 1 plot
p.plot <- ggplot() +
# p mutant, below fdr
geom_point(data = p.data[p.data$p.pvalues > p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(p.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = p.data[p.data$p.pvalues <= p.05,],
mapping = aes(x = p.differences, y = -log(p.pvalues)), alpha = 0.3, colour = "Red") +
coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
xlab("P Mutant Difference (Z-score Normalized log2(P - Ctrl)") +
theme_classic()
p.plot
```
```{r}
nls.plot <- ggplot() +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = -(nls.differences), y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
coord_cartesian(xlim = c(-0.3, -4.6), ylim = c(0, 18)) +
xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
theme_classic()
#nls.plot
nls.plot2 <- ggplot() +
# nls mutant, below fdr
geom_point(data = nls.data[nls.data$nls.pvalues > nls.05,],
mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.1) +
geom_hline(yintercept = -log(nls.05), alpha = 0.5, lty = 2) +
# above fdr
geom_point(data = nls.data[nls.data$nls.pvalues <= nls.05,],
mapping = aes(x = nls.differences, y = -log(nls.pvalues)), alpha = 0.3, colour = "blue") +
coord_cartesian(xlim = c(0.3, 4.6), ylim = c(0, 18)) +
xlab("NLS Mutant Difference (-(Z-score Normalized log2(NLS - Ctrl))") +
theme_classic()
nls.plot2
```
```{r}
venn <- draw.pairwise.venn(57 + 24, 40 + 24, 24,
fill = c("blue", "red"),
alpha = c(0.5, 0.5),
category = c("NLS", "P"))
venn
```
```r
#pdf(file = "fdr_05.pdf")
#grid.draw(venn)
#dev.off()
venn
```
```
## (polygon[GRID.polygon.384], polygon[GRID.polygon.385], polygon[GRID.polygon.386], polygon[GRID.polygon.387], text[GRID.text.388], text[GRID.text.389], text[GRID.text.390], text[GRID.text.391], text[GRID.text.392])
```
Head(output.table)
```r
output <- NULL
output <- cbind(clean.PG[, 1:3], ttest.results[, 1:4])
output <- left_join(output, p.data[, c(2,9)], by = "Fasta.headers")
output <- cbind(output, ttest.results[, 5:7])
output <- left_join(output, nls.data[, c(2,9)], by = "Fasta.headers")
output <- cbind(output, clean.PG[,5:15])
output <- cbind(output, log.LFQ)
output <- cbind(output, log.LFQ.imputed)
output <- data.frame(apply(output, 2, function(x) ifelse(is.na(x), "", x)))
#write.table(output, "AKAP18_p_nls_BioID.txt",
# append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)
head(output)
```