-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathanalyze_income_final.Rmd
187 lines (122 loc) · 8.14 KB
/
analyze_income_final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
title: "Income Analysis"
author: "Karl Stavem"
date: "3/15/2020"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
rm(list = ls())
library(gplots)
library(ggplot2)
source('clean_data_file.R')
```
# Income
With regard to income, we looked at three separate questions.
```{r}
#### Initial Set Up
#remove NAs from the relationship quality column
df <- df[is.na(df$RELATIONSHIP_QUALITY)==FALSE,]
# clean up the income column - grab the number associated with each category
df$income_val <- as.numeric(substr(df$PPINCIMP, 2, 3))
# clean up the relationship quality column - grab the number associated with each category
df$relationship_val <- as.numeric(substr(df$RELATIONSHIP_QUALITY, 2, 2))
# create a boolean to track if couples make $75,000 or more
df$income_greater_than_75k <- as.numeric(ifelse(df$income_val >= 14, 1, 0))
# split into two groups
more.than.75k.group <- df$relationship_val[df$income_greater_than_75k == 1]
less.than.75k.group <- df$relationship_val[df$income_greater_than_75k == 0]
```
_Income Model - One:_
Our first income model attempted to answer the basic question, “Does total household income play a significant role in the relationship quality that couples report?”
Rather than continuous dollar values, the income information from our dataset was reported in categorical ranges and binned into 19 separate buckets. Viewing the dataset as a whole we noted that income was approximately normally distributed, centered around the \$60,000-\$75,000 salary range.
```{r}
ggplot(df, aes(PPINCIMP)) + geom_bar() + theme_bw() + xlab("Household Income") + coord_flip()
```
The mean relationship quality for each of these income ranges appeared to suggest that increased income was associated with an increased relationship score.
```{r}
plotmeans(relationship_val ~ income_val,
data = df,
frame = FALSE,
xlab = "Income Level",
ylab = "Relationship Quality",
barcol = "lightgray",
mean.labels = FALSE,
connect = FALSE,
n.label = FALSE,
main="Mean Relationship Quality vs. Income Level")
```
In order to test this relationship, we set up an ANOVA model with total household income as the predictor variable and relationship quality as the outcome variable. The ANOVA model was a logical choice for this dataset given its approximate normality and nearly constant variance. Within our model, income was treated as a factor with a separate mean relationship quality score fitted to each of the income groups. Our null hypothesis asserted that mean relationship quality did not depend on household income. Written formally as,
$$H_0: \mu_1 = \mu_2 = \dots = \mu_{19} $$
where $\mu_1 \dots \mu_{19}$ represent the mean relationship quality in each of the 19 income groups. After running the model, it was not entirely surprising to see a very small p-value of, 0.000271. This showed strong evidence against the null hypothesis and would seem to suggest that total household income is a significant factor in relationship quality.
```{r}
model7 <- aov(relationship_val ~ factor(income_val), data = df)
summary.model7 <- summary(model7)
summary.model7
```
_Income MOdel - Two_
With our second income model we attempted to answer the question, “Do couples with income above a certain threshold report higher relationship quality than those that fall below?”
The rationale behind this question came from a study in the Huffington post. That study claimed that a point exists at which income no longer provides happiness in the same measure. For less wealthy individuals, every additional dollar of income tends to increase happiness. However, once income reaches a certain threshold, each additional dollar doesn’t produce as much joy in the earner. The happiness threshold cited in that article was roughly \$75,000 per year. We set our threshold here and examined the article’s theory.
We divided our respondents into two disjoint subgroups: Those couples that earned above \$75,000 per year and those that earned below \$75,000. After isolating our respondents, we saw approximately 1,000 couples in the above \$75k group and approximately 1,900 in the below \$75,000 group.
```{r}
# look at number in each group
length(more.than.75k.group)
length(less.than.75k.group)
# difference in size
length(less.than.75k.group) - length(more.than.75k.group)
# show the sizes
barplot(table(df$income_greater_than_75k),
main = 'Combined Income',
xlab = 'Income Above $75k',
ylim = c(0,2000),
ylab = 'Count'
)
```
When we analyzed the data, we noted that the sample mean for those above $75,000 was slightly higher than those below.
```{r}
mean(more.than.75k.group)
mean(less.than.75k.group)
```
Our null hypothesis asserted that mean relationship quality is no different between the two groups. Our alternative hypothesis stated that mean relationship quality would be greater for those in the higher income group. This is because individuals earning above $75k have essentially maximized potential happiness, while those below still have room to improve. These two statements can be written formally as,
$$ H_0: \mu_1 = \mu_0 \\ H_A: \mu_1 > \mu_0 $$
We examined these statements using a one-sided Welch T-Test. After running the model, we noted a small p-value of 0.0003428 demonstrating sufficient evidence to reject the null hypothesis. Evidence seems to suggest that mean relationship quality is greater for those earning above \$75,000.
```{r}
# look at the results of the test
model1 <- t.test(more.than.75k.group, less.than.75k.group, var.equal = FALSE, alternative = 'g')
model1
```
_Income Model - Three_
In our third model we looked at relationships in which both partners received significantly different salaries. We studied whether this income disparity affected the quality of the relationship.
Within the survey, respondents were asked to answer the following question: _“Between you and [partner_name], who earned more income in 2008?”_ Respondents answered with one of three options:
1. I earned more.
2. We earned about the same.
3. Partner earned more.
It is important to note that answers were based on feelings about income, rather than specific numbers. Using these subjective responses as the basis for this model, we divided the sample into two disjoint groups:
* No Income Disparity: We earned about the same amount.
* Income Disparity: I earned more. & Partner earned more.
Since all respondents fell naturally into one of these distinct groups, we created a simple binary variable to store this information _(1 = Income Disparity, 0 = No Income Disparity)_. This binary parameter instinctively lead to a logistic regression model; however, it required that we format our question in a particular way,
_“Does relationship quality tell us anything about the odds of having an income disparity between partners?”_
A logistic model holds three additional properties which were also necessary for this data set:
* Logistic regression does not require a linear relationship.
* The residuals of a logistic model do not need to be normally distributed.
* Homoscedasticity is not required.
Looking at the residuals of this model, we clearly do not have a normal distribution.
```{r}
model2 <- lm(relationship_val ~ diff_in_income, data=df)
model2.summary <- summary(model2)
x <- model2$residuals
h<-hist(x, xlab="Model 2 Residuals", ylim=c(0,1500),xlim=c(-3,2),
main="")
xfit<-seq(min(x),1,length=50)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="red", lwd=2)
```
After running the logistic model, we arrived at the following results.
```{r}
model5 <- summary(glm(diff_in_income ~ relationship_val, data=df, family = binomial))
model5$coef
model5.coeff <- exp(model5$coefficients['relationship_val', 'Estimate'])
model5.coeff
```
We calculated an exponentiated coefficient of 0.881, which would suggest that for each unit increase in relationship quality, the odds of having an income disparity in the relationship decrease by roughly 12%. However, the p-value in this model was quite high and suggested that this likely is not a highly significant result.