-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy path417_regression.Rmd
285 lines (205 loc) · 11.3 KB
/
417_regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Regression {#regression}
## Linear Regression modeling
- _Linear Regression_ is one of the oldest and most known predictive methods. As its name says, the idea is to try to fit a linear equation between a dependent variable and an independent, or explanatory, variable. The idea is that the independent variable $x$ is something the experimenter controls and the dependent variable $y$ is something that the experimenter measures. The line is used to predict the value of $y$ for a known value of $x$. The variable $x$ is the predictor variable and $y$ the response variable.
- _Multiple linear regression_ uses 2 or more independent variables for building a model. See <https://www.wikipedia.org/wiki/Linear_regression>.
- First proposed many years ago but still very useful...

- The equation takes the form $\hat{y}=b_0+b_1 * x$
- The method used to choose the values $b_0$ and $b_1$ is to minimize the sum of the squares of the residual errors.
### Regression: Galton Data
Not related to Software Engineering but ...
```{r warning=FALSE, message=FALSE}
library(UsingR)
data(galton)
par(mfrow=c(1,2))
hist(galton$child,col="blue",breaks=100)
hist(galton$parent,col="blue",breaks=100)
plot(galton$parent,galton$child,pch=1,col="blue", cex=0.4)
lm1 <- lm(galton$child ~ galton$parent)
lines(galton$parent,lm1$fitted,col="red",lwd=3)
plot(galton$parent,lm1$residuals,col="blue",pch=1, cex=0.4)
abline(c(0,0),col="red",lwd=3)
qqnorm(galton$child)
```
### Simple Linear Regression
- Given two variables $Y$ (response) and $X$ (predictor), the assumption is that there is an approximate ($\approx$) *linear* relation between those variables.
- The mathematical model of the observed data is described as (for the case of simple linear regression):
$$ Y \approx \beta_0 + \beta_1 X$$
- the parameter $\beta_0$ is named the *intercept* and $\beta_1$ is the *slope*
- Each observation can be modeled as
$$y_i = \beta_0 + \beta_1 x_i + \epsilon_i;
\epsilon_i \sim N(0,\sigma^2)$$
- $\epsilon_i$ is the *error*
- This means that the variable $y$ is normally distributed:
$$ y_i \sim N( \beta_0 + \beta_1 x_i, \sigma^2) $$
- The *predictions* or *estimations* of this model are obtained by a linear equation of the form $\hat{Y}=\hat{\beta_0} + \hat{\beta}_1X$, that is, each new prediction is computed with
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_i $$.
- The actual parameters $\beta_0$ and $\beta_1$ are unknown
- The parameters $\hat{\beta}_0$ and $\hat{\beta}_1$ of the linear equation can be estimated with different methods.
### Least Squares
- One of the most used methods for computing $\hat{\beta}_0$ and $\hat{\beta}_1$ is the criterion of "least squares" minimization.
- The data is composed of $n$ pairs of observations $(x_i, y_i)$
- Given an observation $y_i$ and its corresponding estimation $\hat{y_i})$ the *residual* $e_i$ is defined as $$e_i= y_i - \hat{y_i}$$
- the Residual Sum of Squares is defined as $$RSS=e_1^2+\dots + e_i^2+\dots+e_n^2$$
- the Least Squares Approach minimizes the RSS
- as result of that minimizitation, it can be obtained, by means of calculus, the estimation of $\hat{\beta}_0$ and $\hat{\beta}_1$ as $$\hat{\beta}_1=\frac{\sum_{i=1}^{n}{(x_i-\bar{x})(y_i-\bar{y})}}{\sum_{i=1}^{n}(x_i-\bar{x})^2}$$ and $$\hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x} $$ where $\bar{y}$ and $\bar{x}$ are the sample means.
- the variance $\sigma^2$ is estimated by
$$\hat\sigma^2 = {RSS}/{(n-2)}$$ where n is the number of observations
- The *Residual Standard Error* is defined as $$RSE = \sqrt{{RSS}/{(n-2)}}$$
- The equation $$ Y = \beta_0 + \beta_1 X + \epsilon$$ defines the linear model, i.e., the *population regression line*
- The *least squares line* is $\hat{Y}=\hat{\beta_0} + \hat{\beta}_1X$
- *Confidence intervals* are computed using the *standard errors* of the intercept and the slope.
- The $95\%$ confidence interval for the slope is computed as $$[\hat{\beta}_1 - 2 \cdot SE(\hat{\beta}_1), \hat{\beta}_1+SE(\hat{\beta}_1)]$$
- where $$ SE(\hat{\beta}_1) = \sqrt{\frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\bar{x})^2}}$$
### Linear regression in R
The following are the basic commands in R:
- The basic function is `lm()`, that returns an object with the model.
- Other commands: `summary` prints out information about the regression, `coef` gives the coefficients for the linear model, `fitted` gives the predictd value of $y$ for each value of $x$, `residuals` contains the differences between observed and fitted values.
- [`predict`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/predict.lm.html) will generate predicted values of the response for the values of the explanatory variable.
## Linear Regression Diagnostics
- Several plots help to evaluate the suitability of the linear regression
+ *Residuals vs fitted*: The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points.
+ *Standard Q-Q plot*: residual errors are normally distributed
+ *Square root of the standardized residuals vs the fitted values*: there should be no obvious trend. This plot is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals.
+ *Leverage*: measures the importance of each point in determining the regression result. Smaller values means that removing the observation has little effect on the regression result.
### Simulation example
#### Simulate a dataset
```{r}
set.seed(3456)
# equation is y = -6.6 + 0.13 x +e
# range x 100,400
a <- -6.6
b <- 0.13
num_obs <- 60
xmin <- 100
xmax <- 400
x <- sample(seq(from=xmin, to = xmax, by =1), size= num_obs, replace=FALSE)
sderror <- 9 # sigma for the error term in the model
e <- rnorm(num_obs, 0, sderror)
y <- a + b * x + e
newlm <- lm(y~x)
summary(newlm)
cfa1 <- coef(newlm)[1]
cfb2 <- coef(newlm)[2]
plot(x,y, xlab="x axis", ylab= "y axis", xlim = c(xmin, xmax), ylim = c(0,60), sub = "Line in black is the actual model")
title(main = paste("Line in blue is the Regression Line for ", num_obs, " points."))
abline(a = cfa1, b = cfb2, col= "blue", lwd=3)
abline(a = a, b = b, col= "black", lwd=1) #original line
```
##### Subset a set of points from the same sample
```{r}
# sample from the same x to compare least squares lines
# change the denominator in newsample to see how the least square lines changes accordingly.
newsample <- as.integer(num_obs/8) # number of pairs x,y
idxs_x1 <- sample(1:num_obs, size = newsample, replace = FALSE) #sample indexes
x1 <- x[idxs_x1]
e1 <- e[idxs_x1]
y1 <- a + b * x1 + e1
xy_obs <- data.frame(x1, y1)
names(xy_obs) <- c("x_obs", "y_obs")
newlm1 <- lm(y1~x1)
summary(newlm1)
cfa21 <- coef(newlm1)[1]
cfb22 <- coef(newlm1)[2]
plot(x1,y1, xlab="x axis", ylab= "y axis", xlim = c(xmin, xmax), ylim = c(0,60))
title(main = paste("New line in red with ", newsample, " points in sample"))
abline(a = a, b = b, col= "black", lwd=1) # True line
abline(a = cfa1, b = cfb2, col= "blue", lwd=1) #sample
abline(a = cfa21, b = cfb22, col= "red", lwd=2) #new line
```
##### Compute a confidence interval on the original sample regression line
```{r}
newx <- seq(xmin, xmax)
ypredicted <- predict(newlm, newdata=data.frame(x=newx), interval= "confidence", level= 0.90, se = TRUE)
plot(x,y, xlab="x axis", ylab= "y axis", xlim = c(xmin, xmax), ylim = c(0,60))
# points(x1, fitted(newlm1))
abline(newlm)
lines(newx,ypredicted$fit[,2],col="red",lty=2)
lines(newx,ypredicted$fit[,3],col="red",lty=2)
# Plot the residuals or errors
ypredicted_x <- predict(newlm, newdata=data.frame(x=x))
plot(x,y, xlab="x axis", ylab= "y axis", xlim = c(xmin, xmax), ylim = c(0,60), sub = "", pch=19, cex=0.75)
title(main = paste("Residuals or errors", num_obs, " points."))
abline(newlm)
segments(x, y, x, ypredicted_x)
```
##### Take another sample from the model and explore
```{r}
# equation is y = -6.6 + 0.13 x +e
# range x 100,400
num_obs <- 35
xmin <- 100
xmax <- 400
x3 <- sample(seq(from=xmin, to = xmax, by =1), size= num_obs, replace=FALSE)
sderror <- 14 # sigma for the error term in the model
e3 <- rnorm(num_obs, 0, sderror)
y3 <- a + b * x3 + e3
newlm3 <- lm(y3~x3)
summary(newlm3)
cfa31 <- coef(newlm3)[1]
cfb32 <- coef(newlm3)[2]
plot(x3,y3, xlab="x axis", ylab= "y axis", xlim = c(xmin, xmax), ylim = c(0,60))
title(main = paste("Line in red is the Regression Line for ", num_obs, " points."))
abline(a = cfa31, b = cfb32, col= "red", lwd=3)
abline(a = a, b = b, col= "black", lwd=2) #original line
abline(a = cfa1, b = cfb2, col= "blue", lwd=1) #first sample
# confidence intervals for the new sample
newx <- seq(xmin, xmax)
ypredicted <- predict(newlm3, newdata=data.frame(x3=newx), interval= "confidence", level= 0.90, se = TRUE)
lines(newx,ypredicted$fit[,2],col="red",lty=2, lwd=2)
lines(newx,ypredicted$fit[,3],col="red",lty=2, lwd=2)
```
### Diagnostics fro assessing the regression line
#### Residual Standard Error
- It gives us an idea of the typical or average error of the model. It is the estimated standard deviation of the residuals.
#### $R^2$ statistic
- This is the proportion of variability in the data that is explained by the model. Best values are those close to 1.
## Multiple Linear Regression
### Partial Least Squares
- If several predictors are highly correlated, the least squares approach has high variability.
- PLS finds linear combinations of the predictors, that are called *components* or *latent* variables.
## Linear regression in Software Effort estimation
Fitting a linear model to log-log
- the predictive power equation is $y= e^{b_0}*x^{b_1}$, ignoring the bias corrections. Note: depending how the error term behaves we could try another general linear model (GLM) or other model that does not rely on the normality of the residuals (quantile regression, etc.)
- First, we are fitting the model to the whole dataset. But it is not the right way to do it, because of overfitting.
```{r warning=FALSE, message=FALSE}
library(foreign)
china <- read.arff("./datasets/effortEstimation/china.arff")
china_size <- china$AFP
summary(china_size)
china_effort <- china$Effort
summary(china_effort)
par(mfrow=c(1,2))
hist(china_size, col="blue", xlab="Adjusted Function Points", main="Distribution of AFP")
hist(china_effort, col="blue",xlab="Effort", main="Distribution of Effort")
boxplot(china_size)
boxplot(china_effort)
qqnorm(china_size)
qqline(china_size)
qqnorm(china_effort)
qqline(china_effort)
```
Applying the `log` function (it computes natural logarithms, base $e$)
```{r, echo=FALSE}
par(mfrow=c(1,2))
logchina_size = log(china_size)
hist(logchina_size, col="blue", xlab="log Adjusted Function Points", main="Distribution of log AFP")
logchina_effort = log(china_effort)
hist(logchina_effort, col="blue",xlab="Effort", main="Distribution of log Effort")
qqnorm(logchina_size)
qqnorm(logchina_effort)
```
```{r}
linmodel_logchina <- lm(logchina_effort ~ logchina_size)
par(mfrow=c(1,1))
plot(logchina_size, logchina_effort)
abline(linmodel_logchina, lwd=3, col=3)
par(mfrow=c(1,2))
plot(linmodel_logchina, ask = FALSE)
linmodel_logchina
```
## References
- The New Statistics with R, Andy Hector, 2015
- An Introduction to R, W.N. Venables and D.M. Smith and the R Development Core Team
- Practical Data Science with R, Nina Zumel and John Mount
- G. James et al, An Introduction to Statistical Learning with Applications in R, Springer, 2013