-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmetrics_khurana_A1_01202025.Rmd
188 lines (151 loc) · 7.03 KB
/
metrics_khurana_A1_01202025.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "A1: Campus mandate"
author: "Havisha Khurana"
date: "`r Sys.Date()`"
output:
html_document:
code_folding: hide
theme: journal
highlight: tango
toc: true
---
## Task
**Assignment** Propose and simulate a data-generating process in which (i) students who live on campus in their freshman year tend to have better outcomes, but at the same time (ii) the causal effect of living on campus is negative.
- Up to you... whether you'd like to have the causal effect be negative for all students, or positive for those who do select into living on campus and negative for those who do not—this would actually make sense in a model where students selected into living accommodations with knowledge of what suited them best.
* Maybe allow for heterogeneous treatment effects if you have time? We will talk about heterogeneous treatment effects, so having a DGP at the ready would be useful.
- Complete this task in a markdown file and make the argument that's relevant to the policy makers visually (e.g., likely something that starts with `ggplot`).
Let's model this as an OVB problem, where the true relationship is captured in some $y = f(x_{1}, x_{2}, T),$ where $T_i=1$ if student $i$ lives on campus, and is zero otherwise. The range of $y$ should make sense to the audience—I'll leave it to you to decide what to do about that. (Maybe make $x_1$ continuous and $x_2$ discrete, just to have the potential for both?)
In order to introduce the potential for non-random treatment (i.e., on-campus living) let's assume that the linear combination $Z = 1 + a_1x_1 + a_2x_2$ determines the probability each individual lives on campus, such that $$Pr[T=1] = \frac{1}{1+e^{-Z}}~.$$ In that way, $\{a_1, a_2\} > 0$ will imply that "better" students live on-campus, on average—positive selection into treatment. (_While there are many reasonable ways to model this, we'll see some good intuition unfold following this setup._)
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
message = FALSE,
error = FALSE)
```
## Negative treatment effect for all
```{r load libraries}
library(tidyverse)
library(modelsummary)
library(patchwork)
```
### DGP
``` {r dgp, echo=T}
# negative causal effect of living on campus: -0.5
dgp_negative <- map(1:100, ~ {
set.seed(.x)
tibble(
# normally-distributed random continuous variable that affects gpa, such as the standardized SAT score
x_1 = rnorm(100),
# uniformly-distributed random discrete variable that affects gpa, such as low-risk for alcohol
x_2 = sample(c(1, 0), 100, replace = TRUE),
# a random composite variable that affects the probability of living on campus
z = 1 + x_1 + x_2,
# and scale z so we get a good coverage of t
z_scale = scale(z),
# probability of living on campus, as a logistic function
t = 1 / (1 + exp(-z_scale)),
# transforming probability into a binary variable with balanced sample
t_binary = ifelse(t > median(t), 1, 0),
# add random noise
e = rnorm(100, mean = 0, sd = 0.2),
#outcome as a function of x_1, x_2, and t - adding a constant so that the range of gpa makes sense
# scaling x_1 + x_2 so that the mean GPA is 3 with a standard deviation of 0.6
# the effect of living on campus is -0.5
y = ((x_1 + x_2 - mean(x_1 + x_2)) * 0.6) / sd(x_1 + x_2) + 3.3 - 0.5 *
t_binary + e
) |>
mutate(#ensuring that the re-scaled variable doesn't exceed 4.3
y = ifelse(y > 4.3, 4.3, y))
})
```
``` {r choose_iteration, echo=T}
# finding an iteration where students who live on campus in their freshman year tend to have better outcomes, but at the same time (ii) the estimated causal effect of living on campus is negative
max_positive_iter <- map_dbl(dgp_negative, ~ coefficients(lm(y ~ factor(t_binary), data = .x))[2]) |>
which.max()
# isolate the max_iter dataframe
dgp_max_positive <- dgp_negative[[max_positive_iter]]
```
### Descriptives
```{r}
gpa_plot <- ggplot(dgp_max_positive,
aes(x = factor(t_binary), y = y)) +
geom_boxplot(fill = "transparent") +
geom_jitter(color = "cornflowerblue", width = 0.1) +
stat_summary(
aes(group = factor(t_binary), label = round(..y.., 1)),
fun = "median", geom = "label", color = "grey20", vjust = -0.3
)+
scale_x_discrete(labels = c("No", "Yes")) +
theme_minimal()+
labs(
x = "On-campus Living in Freshmen year?",
y = "GPA",
) +
theme(
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "gray96", color = "transparent")
)
x1_plot <- ggplot(dgp_max_positive,
aes(x = factor(t_binary), y = (x_1*50 + 1200))) +
geom_boxplot(fill = "transparent") +
geom_jitter(color = "cornflowerblue", width = 0.1, size = 0.5) +
stat_summary(
aes(group = factor(t_binary), label = round(..y..)),
fun = "median", geom = "label", color = "grey20", vjust = -0.3, size = 2
)+
scale_x_discrete(labels = c("No", "Yes")) +
theme_minimal()+
labs(
#x = "Lived on-campus in Freshmen year?",
y = "x_1\n(SAT scores)",
) +
theme(
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "gray96", color = "transparent")
)
x2_plot <- ggplot(filter(dgp_max_positive, x_2 == 0),
aes(x = factor(t_binary))) +
geom_bar(fill = "darkorange") +
scale_x_discrete(labels = c("No", "Yes")) +
theme_minimal()+
labs(
x = "On-campus Living?",
y = "#Count of x_2\n(High-risk for alcohol-addiction)",
) +
theme(
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "gray96", color = "transparent")
)
```
```{r}
gpa_plot + (x1_plot / x2_plot) +
plot_layout(widths = c(2, 1)) +
plot_annotation(
title = "Positive Selection into Living On-Campus",
subtitle = str_wrap("Students living on campus have higher undergraduate GPAs than students living off-camppus, BUT they also have higher SAT scores and a lower likelihood of alcohol addiction"),
caption = "Number represents medians.\nSimulated data with positive selection into treatment and negative treatment effect."
) &
theme(
plot.background = element_rect(fill = "gray96", color = "transparent")
)
```
### Regression analysis
```{r}
# model the problem? communicate results?
model_binary <- lm(y ~ t_binary,
data = dgp_max_positive)
model_multi <- lm(y ~ t_binary + x_1 + x_2,
data = dgp_max_positive)
modelsummary(
list(model_binary, model_multi),
stars = TRUE,
coef_rename = c(t_binary = "Lived on-campus",
x_1 = "Standardized SAT scores",
x_2 = "Low risk of alcohol-addiction"),
gof_omit = "AIC|BIC|R2 |Log.|RMSE",
title = "Relationship between Undergrad GPA and Living On-Campus",
notes = "Positive estimate of living on-campus turns negative after adjusting for SAT scores and alcohol-addiction risk."
)
```
Positive estimate of living on-campus turns negative after adjusting for SAT scores and alcohol-addiction risk.