-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment4work.Rmd
479 lines (385 loc) · 19 KB
/
Assignment4work.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
---
title: "Smoothing splines and modelling: tricep skin thickness and cancer mortality case studies"
author: "Cimmaron Yeoman"
date: "2023-03-17"
output:
pdf_document: default
word_document: default
header-includes:
- \usepackage{geometry} \geometry{top=0.75in,left=1in,bottom=1in,right=1in}
- \usepackage{sectsty}
- \allsectionsfont{\color{blue}}
html_document: default
---
## Part 1: Smoothing and Triceps Data
The first section of this report evaluated the tricep skinfold thickness of 892 Gambian women in three West African villages, over 50 years. The *triceps brachii* is a muscle situated on the back of the the arm, and includes a lateral, long, and medial head (see Figure 1). The data set includes the tricep skinfold thickness measurements of the Gambian women, the log of the tricep skinfold thickness measurements, and the age of the women. The tricep skinfold thickness was predicted as a function of age, using a smoothing model.
**Figure 1**
*Triceps brachii anterior and posterior view*
```{r, out.width= "150px", out.height= "200px", echo = FALSE}
library(knitr)
library(Matrix)
knitr::include_graphics(c("Triceps_ant.png", "Triceps_post.png"))
```
*Note.* This figure highlights the heads of the *triceps brachii* on the human body.
The long head was highlighted in red, the lateral head in yellow, and the medial head in green.
These images provide an anterior and posterior view of the *triceps brachii* muscles on the
human skeleton.
```{r, echo = FALSE, results = 'hide', message = FALSE, warning = FALSE}
library(readr)
triceps_ds <- read_csv("triceps.csv")
```
## *Triceps brachii* data
### Data set and missing observations
The triceps data set (**triceps2**) was evaluated for any missing variables. There were no
missing observations and all cases were complete. The data set was renamed to **triceps2**.
This will be the working data set for Part 1 of this report.
```{r}
triceps2 <- triceps_ds
triceps2 <- triceps2[complete.cases(triceps2), ]
```
## Visualizing the data set: scatter plots
**Figure 2**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.height= "75%"}
plot(triceps2$age, triceps2$triceps, xlab = "Age in years",
ylab = "Triceps skinfold thickness", col = "gray20", cex = 0.7)
abline(lm(triceps2$triceps ~ triceps2$age), col = "red", lwd = 2)
```
**Figure 3**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.height= "75%"}
plot(triceps2$age, triceps2$lntriceps, xlab = "Age in years",
ylab = "ln of triceps skinfold thickness", col = "gray20", cex = 0.7)
abline(lm(triceps2$lntriceps ~ triceps2$age), col = "orange", lwd = 2)
```
Both the **triceps** and **lntriceps** (log of tricep skinfold thickness) did not fit a
simple linear regression well. When plotted with **age** on the x-axis, both variables
showed distinct curves and a non-linear pattern.
## Smoothing spline models: tricep skin thickness and age
A model was made using the smooth.spline function for both the tricep skinfold thickness and
the log of tricep skinfold thickness. Cross-validation was used to determine the equivalent
degrees of freedom.
```{r, warning = FALSE, results = 'hide'}
library(splines)
mod_ss <- smooth.spline(triceps2$age, triceps2$triceps, cv = TRUE)
mod_ss_ln <- smooth.spline(triceps2$age, triceps2$lntriceps, cv = TRUE)
```
```{r, include = FALSE}
mod_ss
mod_ss_ln
```
## Smooth spline plots: tricep skin thickness or log of tricep thickness and age
**Figure 4**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.height= "75%"}
plot(triceps2$age, triceps2$triceps, xlab = "Age in years",
ylab = "Tricep skinfold thickness", col = "gray20", cex = 0.7)
lines(mod_ss, col = "red", lwd = 2)
legend("topleft", legend = c("7.9 DF"), col = "red",
lty = 1, lwd = 2 , cex = 0.8)
```
**Figure 5**
\vspace*{-15mm}
```{r, echo = FALSE, out.width="75%", out.height = "75%"}
plot(triceps2$age, triceps2$lntriceps, xlab = "Age in years",
ylab = "ln of tricep skinfold thickness", col = "gray20", cex = 0.7)
lines(mod_ss_ln, col = "orange", lwd = 2)
legend("topleft", legend = c("10.6 DF"), col = "orange",
lty = 1, lwd = 2 , cex = 0.8)
```
Using the smooth.spline function with cross-validation included, the smoothing parameter for the
**triceps** ~ **age** model had an equivalent degrees of freedom of **7.9**, and the **lntriceps** ~ **age**
model had an equivalent degrees of freedom of **10.6**. Lambda was about **0.005** for the first model and
**0.001** for the second model; the curve for the first model was slightly smoother. Neither of the
smoothing splines appeared to over or underfit the data.
## Smooth splines with different *df* (log of tricep skin thickness)
**Figure 6**
\vspace*{-15mm}
```{r echo=FALSE, warning=FALSE, out.width = "75%", out.height= "75%"}
mod_ss_ln <- smooth.spline(triceps2$age, triceps2$lntriceps, cv = TRUE)
mod_ln2 <- smooth.spline(triceps2$age, triceps2$lntriceps, df = 14)
mod_ln3 <- smooth.spline(triceps2$age, triceps2$lntriceps, df = 40)
plot(triceps2$age, triceps2$lntriceps, xlab = "Age in years",
ylab = "ln of tricep skin thickness", col = "gray20", cex = 0.7)
lines(mod_ss_ln, col = "orangered2", lwd = 2)
lines(mod_ln2, col = "steelblue", lwd = 2)
lines(mod_ln3, col = "olivedrab", lwd = 2)
legend("topleft", legend = c("10.6 DF", "14 DF", "40 DF"), col = c("orangered2",
"steelblue", "olivedrab"),lty = 1 , lwd = 2 , cex = 0.8)
```
Using the same smooth.spline function, different degrees of freedom were tested with the **lntriceps** ~ **age**
model. As the degrees of freedom increased, the smoothing spline became more bumpy and the lambda values
decreased to much smaller values. The Figure 6 *df* = **14** smoothing spline looked visually similar to
the *df* = **10.6** smoothing spline in Figure 5. The *df* = **40** smoothing spline of Figure 6 was very
rough and clearly overfitting the data.
```{r, include = FALSE }
mod_ln2
mod_ln3
```
## Smoothing splines with different *df* (tricep skin thickness)
**Figure 7**
\vspace*{-15mm}
```{r, warning = FALSE, echo = FALSE, out.width = "75%", out.height = "75%"}
library(splines)
mod_ss <- smooth.spline(triceps2$age, triceps2$triceps, cv = TRUE)
mod_2 <- smooth.spline(triceps2$age, triceps2$triceps, df = 5)
plot(triceps2$age, triceps2$triceps, xlab = "Age in years",
ylab = "Tricep skin thickness", col = "gray20", cex = 0.7)
lines(mod_ss, col = "red", lwd = 2)
lines(mod_2, col = "mediumblue", lwd = 2)
legend("topleft", legend = c("7.9 DF", "5 DF"), col = c("red", "mediumblue"),
lty = 1 , lwd = 2 , cex = 0.8)
```
**Figure 8**
\vspace*{-15mm}
```{r, warning = FALSE, echo = FALSE, out.width = "75%", out.height = "75%" }
mod_3 <- smooth.spline(triceps2$age, triceps2$triceps, df = 40)
mod_4 <- smooth.spline(triceps2$age, triceps2$triceps, df = 18)
plot(triceps2$age, triceps2$triceps, xlab = "Age in years",
ylab = "Tricep skin thickness", col = "gray20", cex = 0.7)
lines(mod_3, col = "seagreen4", lwd = 2)
lines(mod_4, col = "mediumpurple4", lwd = 2)
legend("topleft", legend = c("40 DF", "18 DF"), col = c ("seagreen4",
"mediumpurple4"), lty = 1 , lwd = 2 , cex = 0.8)
```
Similar to the plots with **lntriceps**, the higher the *df*, the rougher the smoothing splines
for **triceps** ~ **age** model appeared, with both splines in Figure 8 overfitting the data. In
Figure 7, the blue *df* = 5 smoothing splines appeared to slightly underfit the data while the
*df* = 7.9 smoothing spline had a better fit.
## Conclusion
The smoothing splines selected that fit the **triceps** ~ **age** or **lntriceps** ~ **age**
models showed the practicality of fitting splines to data sets, and the potential for their
use in predictive models. Very high **df** smoothing splines seem to overfit data sets in most
cases, even when they have clear s-shaped curves and bends.
## Part 2: Cancer data set and lasso regression
Cancer is one of the leading causes of death in the United States and across the globe. The disease
exists in various forms, and millions of people, both young and old, receive a cancer diagnosis
each year. This report contains a United States focused data set, consisting of health and socioeconomic
factors from 3047 counties. The response variable or variable of interest was **TARGET_deathRate**,
the mean per capita (100,000) cancer mortalities.
The goal of this analysis was to identify relevant predictor variables that could be
candidates for a smooth relationship. A model with smoothing was compared to a basic
model. Both natural cubic splines and the smoothing.spline package were used. Variable selection
was performed using the lasso regression method.
```{r, include = FALSE}
library(glmnet)
library(readr)
```
```{r}
cancer_L <- read_csv("C:/Users/cimmy/Documents/2023WI-3561H/Assignment4/cancer_reg.csv",
col_types = cols(Geography = col_skip()))
cancer_L$binnedInc <- as.factor(cancer_L$binnedInc)
```
The data set was imported with the **Geography** or county name variable removed and the
**binnedInc** or binned income variable included as a factor.
## Removing outliers
```{r, include = FALSE}
summary(cancer_L)
```
The **MedianAge** variable had extreme values in the hundreds; to avoid some level of error in
the data analysis, any values over 100 were removed. A median age of 100 would still be considered
extreme and reviewing variable observations in future research is advised.
```{r}
cancer_L$MedianAge[cancer_L$MedianAge > 100] <- NA
```
**Figure 9**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.height="75%"}
hist(cancer_L$MedianAge, main = NULL, xlab = "Median age", col = "gray24")
```
*Note*. This histogram shows the corrected distribution of **MedianAge** values after outlier values
were removed.
### Working data set
The final step before performing lasso regression was evaluation for missing observations.
There were no missing observations detected, all cases were complete. The data set was renamed
to **can_L**. This will be the working data set for Part 2 of this report.
```{r}
can_L <- cancer_L[complete.cases(cancer_L), ]
```
## Lasso regression
A test and train set were created from the **can_L** data set.
```{r}
set.seed(56)
trainC <- sample(x = 1:584, size = 292, replace = FALSE)
testC <- (-trainC)
x <- model.matrix ( TARGET_deathRate ~ ., data = can_L) [, -1]
y <- can_L$TARGET_deathRate
```
### Testing a simple linear regression model
```{r, results = 'hide'}
modC1 <- lm(TARGET_deathRate ~ ., data = can_L, subset = trainC)
summary(modC1)
modC1 <- summary(modC1)$adj.r.squared * 100
```
A basic model using the **trainC** subset had an adjusted $R^2$ of `r round(modC1, 1)`% variation explained with 32
predictor variables included. A majority of the variables did not appear to have a significant relationship with the
response variable. Lasso regression will eliminate the less relevant variables before the smooth relationships are
explored.
### Creating a grid of lambdas
**Figure 10**
\vspace*{-5mm}
```{r, warning=FALSE, echo = FALSE, out.width = "75%", out.height ="75%"}
grid <- 10^seq(from = 6, to = -2, length.out = 100)
modC_lasso <- glmnet(x, y, alpha = 1, lambda = grid)
plot(modC_lasso)
```
As the lambda values on the x-axis increase, the larger penalty forces the coefficients of
the less relevant predictor to zero. This is the variable selection action of lasso, which will
help identify the predictors variables for a model under a particular penalty value.
### Selecting lambda
```{r, results = 'hide'}
set.seed(76)
cv_outL <- cv.glmnet(x[trainC, ],
y[trainC],
alpha = 1)
lambda <- cv_outL$lambda.min
lambda
```
The minimum lambda was **1.52** after rounding.
**Figure 11**
\vspace*{-5mm}
```{r, echo = FALSE, out.width = "75%", out.height = "75%"}
plot(cv_outL)
```
The lambda fell between the the two dashed lines on the plot above: this is where the
MSE of prediction was minimized.
### Testing a model and calculating the MSE
```{r, results = 'hide'}
modC_lasso <- glmnet(x[trainC, ], y[trainC], alpha = 1,
lambda = grid, thresh = 1e-10)
best_predL <- predict(modC_lasso, s = lambda, newx = x[testC, ])
mean((best_predL - y[testC])^2) #MSE calculation
```
When lambda was at **1.52** the MSE was **492.1**.
### Eliminating variables
```{r, echo = TRUE, results = 'hide'}
modC_lasso$lambda
coefficients(modC_lasso)[, 73]
```
The 73rd term of **1.519** was the most similar to the minimum lambda value of **1.52**. Many of the
variables were zeroed and therefore could be eliminated from the subset selected for modelling.
### Variable subset
```{r, results = 'hide'}
which(abs(coefficients(modC_lasso)[, 73]) >= 1e-10)
```
The lasso regression identified 13 variables for the subset which could be used in modelling.
From this list, **PctBlack**, **PctPrivateCoverage**, and **medIncome** were selected for the
smooths and models.
## Selecting variables to smooth
A scatter plot matrix showed that many of the predictor variables had a significant positive or
negative linear relationship with the **TARGET_deathRate** response variable. Smoothing did not
appear to be necessary for many variables, and doing so would likely overfit the data. The variables
**PctBlack**, **PctPrivateCoverage**, and **medIncome** were selected, as the pattern they formed when
plotted against the response variable showed slight bends or curves, potentially suitable for smoothing.
```{r, warning = FALSE, include = FALSE}
canF <- can_L[, c("TARGET_deathRate", "PctBlack", "PctPrivateCoverage", "medIncome")]
#extracted variables
canF <- canF[complete.cases(canF), ]
```
```{r, include = FALSE}
library(splines)
```
```{r, warning = FALSE}
A <- seq(from = 0, to = 85)
x1 <- canF$PctBlack
y1 <- canF$TARGET_deathRate
mod_SSA <- smooth.spline(canF$PctBlack, canF$TARGET_deathRate, cv = TRUE)
smooth1 <- glm(y1 ~ ns(x1, df = 5))
```
This code chunk provides an example of the formulas used to create the models and splines. For
each variable, a natural cubic spline and smoothing spline model was made.
```{r, warning = FALSE, include = FALSE}
B <- seq(from = 20, to = 90)
x2 <- canF$PctPrivateCoverage
y2 <- canF$TARGET_deathRate
mod_SSB <- smooth.spline(canF$PctPrivateCoverage, canF$TARGET_deathRate,
cv = TRUE)
smooth2 <- glm(y2 ~ ns(x2, df = 3))
```
```{r, warning = FALSE, include = FALSE}
C <- seq(from = 23000, to = 97279)
x3 <- canF$medIncome
y3 <- canF$TARGET_deathRate
mod_SSC <- smooth.spline(canF$medIncome, canF$TARGET_deathRate, cv = TRUE)
smooth3 <- glm(y3 ~ ns(x3, df = 5))
```
## Plotted natural cubic splines and smoothing splines
**Figure 12**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.width="75%"}
plot(canF$PctBlack, canF$TARGET_deathRate, col = "gray20", cex = 0.7,
xlab = "% of Black population", ylab = "Mean cancer deaths (per capita)",
ylim = c(0,350))
lines(A, predict(smooth1, newdata = data.frame(x1=A)), col = "red", lwd=2)
lines(mod_SSA, col = "blue", lwd = 2)
legend("top", legend = c("DF = 5", "DF = 5.1"), col = c("red", col = "blue"),
lty = 1, lwd = 2 , cex = 0.8)
```
Figure 12 visually had a lot of scatter and a curved, s-shaped pattern that continued along
the x-axis. A majority of the data was grouped between 0-10% on the x-axis.
Fitting a spline with a higher degrees of freedom or specified knots along the s-curve would
look more aesthetically pleasing, but this would overfit the data. The s-shaped data points
did not appear dense enough to justify a rough smoothing spline. The grouping of the data was
somewhat unusual; perhaps there were some missing observations for the **PctBlack** variable in
this data set.
**Figure 13**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.width="75%"}
plot(canF$PctPrivateCoverage, canF$TARGET_deathRate, col = "gray20", cex = 0.7,
xlab = "% of population with private health insurance",
ylab = "Mean cancer deaths (per capita)", ylim = c(0, 350))
lines(B, predict(smooth2, newdata = data.frame(x2=B)), col = "red", lwd=2)
lines(mod_SSB, col = "blue", lwd = 2)
legend("topright", legend = c("DF = 3", "DF = 5.1"), col = c("red", "blue"),
lty = 1, lwd = 2 , cex = 0.8)
```
Figure 13 appeared to have an l-shaped bend where the cancer deaths peaked when the x-value percentages
were between the 55-65% range. After this maximum point, the cancer deaths droped as more of the population
reported having private health insurance. The natural cubic spline selected a *df* of 3, and the smoothing spline
had a *df* of 5. Either of the degrees of freedom seemed suitable, neither under or overfitting the data.
**Figure 14**
\vspace*{-15mm}
```{r, echo = FALSE, out.width = "75%", out.width="75%"}
plot(canF$medIncome, canF$TARGET_deathRate, col = "gray20", cex = 0.7,
xlab = "Median income", ylab = "Mean cancer deaths (per capita)",
ylim = c(0, 350))
lines(C, predict(smooth3, newdata = data.frame(x3=C)), col = "red", lwd=2)
lines(mod_SSC, col = "blue", lwd = 2)
legend("topright", legend = c("DF = 5", "DF = 3.4"), col = c("red", "blue"),
lty = 1, lwd = 2 , cex = 0.8)
```
In the case of Figure 14, I found the lower degrees of freedom selected by the cross-validated
smoothing spline underfit the data, considering the grouping of the data points in the bottom
left corner of the scatter plot. The *df* = 5 natural cubic spline reflected the slight curved
shape of the data.
## With and without smoothing
```{r}
Nsmooth <- glm(TARGET_deathRate ~ PctBlack + PctPrivateCoverage + medIncome, data = canF)
Ysmooth <- glm(TARGET_deathRate ~ ns(PctBlack, df = 5) + ns(PctPrivateCoverage, df = 3)
+ ns(medIncome, df = 5), data = canF)
```
The main differences between these models were the coefficient values and the standard error values.
The model without smoothing had smaller coefficients (mostly under 0.5), with standard error values only
slightly higher. The model with smoothing had a greater range in coefficient values with some in the
positive and negative tens, and one in the low hundreds. The smoothing spline values were not too
ridiculous, but the predictor variables may be correlated - a scatter plot matrix of all the variables
showed that many predictors were correlated with each other.
## Conclusion
The smoothing was unnecessary to some degree. If you were to compare the splines fitting the
tricep skin thickness models in Part 1 of this report, you would see how much better the splines
fit the data. Splines could place too much emphasis on outliers or fitting every single data point
if used improperly. This would result in a poor predictive model.
```{r, echo = FALSE, include = FALSE}
summary(Nsmooth)$coefficients
```
```{r, echo = FALSE, include = FALSE}
summary(Ysmooth)$coefficients
```
```{r, include = FALSE}
mod_SSA
summary(smooth1)
mod_SSB
summary(smooth2)
mod_SSC
summary(smooth3)
```