-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBayesianShrinkage.rmd
443 lines (331 loc) · 18.5 KB
/
BayesianShrinkage.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
title: "Bayesian Shrinkage Models with Applications to Preclinical Research"
author: "Erina Paul (erina.paul@merck.com), Arinjita Bhattacharyya, Richard Baumgartner, Himel Mallick"
output:
pdf_document:
toc: TRUE
highlight: null
---
```{R include=F}
knitr::opts_chunk$set(cache = F, warning = FALSE, message = FALSE,
fig.align = 'center',
fig.width = 6,
fig.height = 4,
tidy = TRUE,
tidy.opts = list(width.cutoff = 50))
```
# Introduction
This tutorial walks through the example applications described in the tutorial paper "Bayesian Shrinkage Models with Applications to Preclinical Research''. Bayesian Shrinkage Models (BSMs) are particularly useful for improving prediction and facilitating uncertainty quantification in high-dimensional regression and classification as an alternative to classical model selection and frequentist penalized regression methods. More details on the methodological details of BSMs are provided in the accompanying manuscript.
In this tutorial, we provide an overview of BSMs for continuous, binary, count, and survival outcomes and demonstrate some use cases of BSMs in R with a particular focus on preclinical applications using one TCGA and one microbiome dataset. These two datasets are briefly discussed in the later sections as well as in the associated manuscript. This tutorial has been led by the *DIA/ASA-BIOP Nonclinical Bayesian Working Group members*.
# Preparing the R workspace
First, users should install and load the following libraries (available from CRAN at the time of writing this tutorial).
```{R include = T}
# Clear workspace
rm(list = ls())
# Load libraries
library(tidyverse) # Data Manipulation
library(bayesreg) # Fitting BSMs
library(plotrix) # Credible Interval Plot
library(coda) # MCMC Diagnostics
library(psych) # Posterior Histograms
library(survC1) # C-index for Survival Data
library(pROC) # AUC Calculation
library(Rfssa) # Load Data from GitHub
```
# Loading and previewing the TCGA data
This first preclinical dataset (TCGA) is obtained from [Huang et al. (2020)](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007607), where the predictors are gene expression measurements (log-transformed FPKM values with a pseudo count of 0.1) from 979 samples (cancer cell lines from 55 different tissues). The processed dataset, which consists of 13,941 genes is available from <https://github.com/emad2/TG-LASSO>. For illustration purposes, we further pre-process and filter the dataset by first removing genes with low variability and then selecting the 100 most variable genes based on gene expression across all samples.
The outcome variable of interest is the clinical drug response (CDR) in these cancer patients who were administered 23 drugs in total. The primary goal of the analysis is to assess drug response prediction performance based on preclinical molecular profiles and identify biomarkers of drug sensitivity and survival. Among the 23 potential continuous response variables (logarithm of half-maximal inhibitory concentration (log (IC50)) of 979 cancer cell lines), we selected the one with the least amount of missing values (**tamoxifen**) throughout the tutorial as the outcome. We also consider the transformed versions of the same variable to showcase the application of BSM to binary and count outcomes in the subsequent sections.
Analysis codes to perform the above pre-processing steps are available from <https://github.com/himelmallick/bsmTutorial>.
In the code snippet below, we first load the dataset into R and extract the gene expression data (X) and the outcome data (Y) for further analysis.
```{R include = T}
# Load data
load_github_data("https://github.com/himelmallick/bspTutorial/blob/master/data/TGLASSO_filtered.RData")
trainX=pcl$trainX
trainY=pcl$trainY
```
# Remove missing values
Once we have loaded the data, we want to select the response variable of interest among 23 available. We choose the one with the the least amount of missing values, which is **tamoxifen.**
```{R include = T}
# Find the least amount of missing data
sort(apply(trainY, 2, function (x) {length(x[is.na(x)])}), decreasing = FALSE)
# Remove missing values
missing_index=which(is.na(trainY$tamoxifen))
trainY=trainY[-missing_index,]
trainX=trainX[-missing_index, ]
```
# Sanity Check
Subsequently, we homogenize both these data matrices after removing the missing values and aligning the rows, which we validate as follows.
```{R include = T}
# Sanity Check
all(rownames(trainX)==rownames(trainY))
```
# Shrinkage priors with continuous response
## Standardizing the data
An important starting point before running a regression model is to standardize the predictors. Some packages handle this internally, while others do not; therefore, we recommend performing this standardization of the predictors prior to analysis. Here, we also center the response variable so that the intercept term is zero for the linear model.
```{R include = T}
# Standardization
y.train = trainY$tamoxifen - mean(trainY$tamoxifen)
x.train = scale(as.matrix(trainX))
df.train = cbind.data.frame(y.train, x.train)
```
Now, we fit BSMs and consider four shrinkage priors, all implemented in the `bayesreg` package: Horseshoe, Horseshoe+, Bayesian Ridge (BR), and Bayesian LASSO (BL). In all models, we use 1,000 samples of the Gibbs sampler after 10, 000 burn-in, with other parameters fixed at their default values.
```{R include = T}
# Horseshoe
set.seed(1234)
fit.hs = bayesreg(y.train ~., df.train, prior = "hs", burnin = 10000)
mse.hs = fit.hs$rootmse
# Horseshoe plus
fit.hsplus = bayesreg(y.train~., df.train, prior="horseshoe+", burnin = 10000)
mse.hsplus = fit.hsplus$rootmse
# Bayesian Ridge
fit.br = bayesreg(y.train ~., df.train, prior = "ridge", burnin = 10000)
mse.br = fit.br$rootmse
# Bayesian LASSO
fit.bl = bayesreg(y.train~., df.train, prior="lasso", burnin = 10000)
mse.bl = fit.bl$rootmse
```
For the continuous response model, we consider mean squared error (MSE) for performance assessment.
```{R include = T}
# MSE
mse.val = rbind(mse.hs, mse.hsplus, mse.br, mse.bl)
colnames(mse.val) = "MSE"
row.names(mse.val) = c("Horseshoe", "Horseshoe+", "Bayesian Ridge", "Bayesian LASSO")
mse.val
```
For the visualization, we consider four types of plots: Auto Correlation Function (ACF), trace, histogram, and credible interval (CI) for the top 10 $\beta$'s based on the absolute posterior median estimates. Without the exception of the CI plot, we only show the plots corresponding to the Bayesian LASSO for brevity but they can be similarly generated for other shrinkage priors and regression models.
```{R include = T}
# Visualization (Top 10 Beta)
# Find the top 10 absolute values based on the posterior median
top_n = 10
# HS
topbeta = apply(abs(t(fit.hs$beta)), 2, median)
topbetas = topbeta[order(topbeta, decreasing = TRUE)]
index_top = rep(NA, top_n)
for(i in 1:top_n){
index_top[i] = which(topbeta == topbetas[i])
}
# Extract top 10 beta
topBeta_hs = t(fit.hs$beta[index_top,])
# HS+
topbeta = apply(abs(t(fit.hsplus$beta)), 2, median)
topbetas = topbeta[order(topbeta, decreasing = TRUE)]
index_top = rep(NA, top_n)
for(i in 1:top_n){
index_top[i] = which(topbeta == topbetas[i])
}
# Extract top 10 beta
topBeta_hsplus = t(fit.hsplus$beta[index_top,])
# BR
topbeta = apply(abs(t(fit.br$beta)), 2, median)
topbetas = topbeta[order(topbeta, decreasing = TRUE)]
index_top = rep(NA, top_n)
for(i in 1:top_n){
index_top[i] = which(topbeta == topbetas[i])
}
# Extract top 10 beta
topBeta_br = t(fit.br$beta[index_top,])
# BL
topbeta = apply(abs(t(fit.bl$beta)), 2, median)
topbetas = topbeta[order(topbeta, decreasing = TRUE)]
index_top = rep(NA, top_n)
for(i in 1:top_n){
index_top[i] = which(topbeta == topbetas[i])
}
# Extract top 10 beta
topBeta_bl = t(fit.bl$beta[index_top,])
#############
# ACF plots #
#############
beta.bl = as.mcmc(topBeta_bl)
par(mfrow = c(3, 4))
acf(beta.bl[1,])
acf(beta.bl[2,])
acf(beta.bl[3,])
acf(beta.bl[4,])
acf(beta.bl[5,])
acf(beta.bl[6,])
acf(beta.bl[7,])
acf(beta.bl[8,])
acf(beta.bl[9,])
acf(beta.bl[10,])
```
```{R include = T}
# Trace plots
par(mfrow = c(3, 4))
traceplot(beta.bl, density = FALSE, smooth = TRUE)
```
```{R include = T}
# Histogram
multi.hist(beta.bl, density=TRUE, main = "", ncol = 5)
```
```{R include = T}
# CI plot
# Assuming each posterior sample matrix has p columns and M samples
cred1 = apply(topBeta_hs, 2, quantile, prob = c(0.025, 0.5, 0.975)) # HS
cred2 = apply(topBeta_hsplus, 2, quantile, prob = c(0.025, 0.5, 0.975)) # HS+
cred3 = apply(topBeta_br, 2, quantile, prob = c(0.025, 0.5, 0.975)) # BR
cred4 = apply(topBeta_bl, 2, quantile, prob = c(0.025, 0.5, 0.975)) # LASSO
L1 = cred1[1,]; L2 = cred2[1,]; L3 = cred3[1,]; L4 = cred4[1,]
m.cre1 = cred1[2,]; m.cre2 = cred2[2,]; m.cre3 = cred3[2,]; m.cre4 =cred4[2,]
U1 = cred1[3,]; U2 = cred2[3,]; U3 = cred3[3,]; U4 = cred4[3,]
xOff = 0.1
Q = 10
plotCI(c(m.cre1, m.cre2, m.cre3, m.cre4),
x = c((1:Q)-2*xOff, (1:Q), (1:Q)+2*xOff, (1:Q)+4*xOff),
ui = c(U1, U2, U3, U4),
li = c(L1, L2, L3, L4),
xlab = "betas",
ylab = "Estimates",
axes = FALSE,
lwd = 1,
ylim = c(-0.2, 0.3),
font = 1,
cex.lab = 1,
scol = "orange3",
col = rep(c("violet", "brown", "purple", "deeppink"), each = Q),
cex = 1,
cex.axis = 1,
pch = rep(15:18, each = Q))
axis(1, 1:Q, font = 1, cex.axis = 1, adj = 0, las = 1)
axis(2)
box()
abline(h = 0, lty = 2, lwd = 2, col = "gray60")
legend(8, 0.3, c("HS", "HS+", "BR", "BL"), pch = c(15, 16, 17, 18), pt.cex = 1,
text.width = 1, col = c("violet", "brown", "purple", "deeppink"))
```
Several observations are in order. First, for this particular example, Bayesian LASSO has the best prediction performance based on the MSE. In reality, this should be based on a test dataset, which is not available for this example. Second, while the ACF plots reveal that the auto-correlations decay to zero rapidly, the trace plots show the history of the parameters across iterations implying that the MCMC chain traverses the posterior space very fast, confirming that the corresponding sampler has good mixing property to facilitate meaningful inference. Third, the histograms of the posterior samples of 1,000 iterations reveal that the posterior distributions are approximately normally distributed as expected. Finally, the CI plots highlight that BSMs provide a valid measure of standard error as a measure of uncertainty, not easily measurable by frequentist methods.
# Save the output
This tutorial is just a starting point of what can be done with a default Bayesian analysis with BSMs but for further downstream analyses, we can save the model output using the following code (example shown for the Horseshoe regression):
`save(fit.hs, file = 'fit.HSContinuous.RData')`
# Shrinkage priors with binary response
Next, we showcase the same 4 shrinkage priors for a binary response. We use the same TCGA dataset as before and transform the continuous response (**tamoxifen**) to binary data by setting individuals with the greater than median drug response as 1 and the rest as 0.
```{R include = T}
# Creating binary response variable: Y
y.train.binary = ifelse(trainY$tamoxifen>=median(trainY$tamoxifen), 1, 0)
y.train.binary = as.factor(y.train.binary)
df.train=cbind.data.frame(y.train.binary, x.train)
# Horseshoe
set.seed(1234)
fit.hs = bayesreg(y.train.binary ~., df.train, model="logistic", prior = "hs", burnin = 10000)
# Horseshoe plus
fit.hsplus = bayesreg(y.train.binary ~., df.train, model="logistic", prior="horseshoe+", burnin = 10000)
# Bayesian Ridge
fit.br = bayesreg(y.train.binary ~., df.train, model="logistic", prior = "ridge", burnin = 10000)
# Bayesian LASSO
fit.bl = bayesreg(y.train.binary ~., df.train, model="logistic",prior="lasso", burnin = 10000)
```
For binary response, we report the area under the curve (AUC) as the performance metric.
```{R include = T}
# AUC
# Horseshoe
y.pred = predict(fit.hs, df.train, type='class')
table.val = table(y.train.binary, y.pred)
roc.obj = roc(as.numeric(y.train.binary), as.numeric(y.pred))
auc.val= auc(roc.obj)
measure.valhs = c(auc.val)
# Horseshoe+
y.pred = predict(fit.hsplus, df.train, type='class')
table.val = table(y.train.binary, y.pred)
roc.obj = roc(as.numeric(y.train.binary), as.numeric(y.pred))
auc.val= auc(roc.obj)
measure.valhsplus = c(auc.val)
# Bayesian Ridge
y.pred = predict(fit.br, df.train, type='class')
table.val = table(y.train.binary, y.pred)
roc.obj = roc(as.numeric(y.train.binary), as.numeric(y.pred))
auc.val= auc(roc.obj)
measure.valbr = c(auc.val)
# Bayesian LASSO
y.pred = predict(fit.bl, df.train, type='class')
table.val = table(y.train.binary, y.pred)
roc.obj = roc(as.numeric(y.train.binary), as.numeric(y.pred))
auc.val= auc(roc.obj)
measure.valbl = c(auc.val)
measure.val = rbind(measure.valhs[1], measure.valhsplus[1], measure.valbr[1],
measure.valbl[1])
colnames(measure.val) = c("AUC")
row.names(measure.val) = c("Horseshoe", "Horseshoe+", "Bayesian Ridge", "Bayesian LASSO")
measure.val
```
# Shrinkage priors with count response
```{R include = T}
# Convert the response variable to count outcome
trainY<- round(exp(trainY))
# Remove missing values
missing_index=which(is.na(trainY$tamoxifen))
trainY=trainY[-missing_index,]
trainX=trainX[-missing_index, ]
# Standardization
y.train = trainY$tamoxifen
x.train = scale(as.matrix(trainX))
df.train = cbind.data.frame(y.train, x.train)
# Horseshoe
set.seed(1234)
fit.hs = bayesreg(y.train ~., df.train, model="poisson", prior = "hs", burnin = 10000)
mse.hs = fit.hs$rootmse
# Horseshoe plus
fit.hsplus = bayesreg(y.train~.,df.train, model="poisson", prior="horseshoe+", burnin = 10000)
mse.hsplus = fit.hsplus$rootmse
# Bayesian Ridge
fit.br = bayesreg(y.train ~., df.train, model="poisson", prior = "ridge", burnin = 10000)
mse.br = fit.br$rootmse
# Bayesian LASSO
fit.bl = bayesreg(y.train~.,df.train, model="poisson", prior="lasso", burnin = 10000)
mse.bl = fit.bl$rootmse
mse.val = rbind(mse.hs, mse.hsplus, mse.br, mse.bl)
colnames(mse.val) = "MSE"
row.names(mse.val) = c("Horseshoe", "Horseshoe+", "Bayesian Ridge", "Bayesian LASSO")
mse.val
```
# Shrinkage prior with survival response
For the survival outcome illustration, we make use of a synthetic microbiome dataset of 100 samples and 353 features available from [Koh et al. (2018)](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4599-8), in which real microbial composition of non-obese diabetic (NOD) mice is used as a template to generate the synthetic counts along with the survival outcome. In order to reduce the effect of zero-inflation in microbiome data, features with no variance or with \>50% zeros are removed, leaving with 50 features for final analysis.
We consider the least square approximation (LSA) transformation to convert the survival response to a continuous outcome. This is particularly because, as of this writing, no publicly available implementations exist for fitting high-dimensional Cox regression with BSMs. In our example, LSA is feasible due to the large sample size but it should be used with caution when the sample size is limited as the LSA method heavily depends on large sample theory.
In the illustration below, we start with an already LSA-transformed response (analysis codes to perform the LSA pre-processing step is available from <https://github.com/himelmallick/bsmTutorial>). We quantify the prediction accuracy as the average Uno's C-statistic, which is a C-index appropriate for right-censored survival data. We use the \emph{Est.Cval} function from the R package \emph{survC1} to carry out this calculation.
```{R include = T}
# Load data
load_github_data("https://github.com/himelmallick/bspTutorial/blob/master/data/Survival.RData")
trainX=pcl$trainX
trainY=c(pcl$trainY)
# Sanity Check
all(rownames(trainX)==rownames(trainY))
# Standardization
y.train = trainY$trainY - mean(trainY$trainY)
x.train = scale(as.matrix(trainX))
df.train=cbind.data.frame(y.train, x.train)
# Horseshoe
set.seed(1234)
fit.hs = bayesreg(y.train ~., df.train, prior = "hs", burnin = 10000) # Horseshoe regression
yhat = as.matrix(pcl$original_X_standardized)%*%as.matrix(fit.hs$beta)
mydata = data.frame(as.matrix(pcl$original_Y), yhat)
out = Est.Cval(mydata, tau = 2000, nofit=TRUE)
cindex.valhs = c(out$Dhat)
# Horseshoe+
fit.hsplus = bayesreg(y.train~., df.train, prior = "horseshoe+", burnin = 10000)
yhat = as.matrix(pcl$original_X_standardized)%*%as.matrix(fit.hsplus$beta)
mydata = data.frame(as.matrix(pcl$original_Y), yhat)
out = Est.Cval(mydata, tau = 2000, nofit=TRUE)
cindex.valhsplus = c(out$Dhat)
# BR
fit.br = bayesreg(y.train ~., df.train, prior = "ridge", burnin = 10000)
yhat = as.matrix(pcl$original_X_standardized)%*%as.matrix(fit.br$beta)
mydata = data.frame(as.matrix(pcl$original_Y), yhat)
out = Est.Cval(mydata, tau = 2000, nofit=TRUE)
cindex.valbr = c(out$Dhat)
# BL
fit.bl = bayesreg(y.train~., df.train, prior = "lasso", burnin = 10000)
yhat = as.matrix(pcl$original_X_standardized)%*%as.matrix(fit.bl$beta)
mydata = data.frame(as.matrix(pcl$original_Y), yhat)
out = Est.Cval(mydata, tau = 2000, nofit=TRUE)
cindex.valbl = c(out$Dhat)
cindex.val = rbind(cindex.valhs, cindex.valhsplus, cindex.valbr, cindex.valbl)
colnames(cindex.val) = "C-index"
row.names(cindex.val) = c("Horseshoe", "Horseshoe+", "Bayesian Ridge", "Bayesian LASSO")
cindex.val
```
## Conclusions
In summary, BSMs have gained stable popularity in the last two decades especially as a flexible and powerful alternative to frequentist penalized regression approaches. We hope that this tutorial serves as a kickstart reference guide to preclinical researchers and practitioners in other related fields interested for the application of BSMs to various problems.
## References
Huang EW, Bhope A, Lim J, Sinha S, Emad A (2020) [Tissue-guided LASSO for prediction of clinical drug response using preclinical samples](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007607). *PLoS Computational Biology* 16(1): e1007607. <https://doi.org/10.1371/journal.pcbi.1007607>
Koh H, Livanos AE, Blaser MJ, Li H (2018). [A highly adaptive microbiome-based association test for survival traits](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-018-4599-8#citeas). *BMC Genomics* 19, 210. <https://doi.org/10.1186/s12864-018-4599-8>
## Citation
Paul E, Bhattacharyya A, Baumgartner R, Mallick H (2023+). [Bayesian Shrinkage Models with Applications to Preclinical Research](https://github.com/himelmallick/bsmTutorial). *Statistics in Medicine* (In Submission).